# 🐦 Audio 101. 1- Audio manipulation & musical notes

## [BirdCLEF 2022](https://www.kaggle.com/c/birdclef-2022)
### Identify bird calls in soundscapes
![](https://storage.googleapis.com/kaggle-competitions/kaggle/33246/logos/header.png)


## Hi and welcome! This is the first kernel of the series `Audio 101`, the documentation of my learning process in the amazing world of audio processing.

**In this short kernel we will go over some very basics abilities for working with audio. We will load the `.ogg` files with `torchaudio`, play them with `IPython.display.Audio` and get familiar manipulating waveforms. We will build a simple musical note generator and play a musical scale!.**

This series aims to get a good understanding of the specific topic from zero.

The ideal reader is a Data Scientist noob with some general knowledge about Deep Learning, but no technical expertise in Audio Processing. 

---

The full series consists of the following notebooks:
1. _[🐦 Audio 101. 1-Audio manipulation & musical notes](https://www.kaggle.com/julian3833/audio-101-1-audio-manipulation-musical-notes/) (This notebook)_
2. [🐦 Audio 101. 2- Detailed EDA](https://www.kaggle.com/julian3833/audio-101-2-detailed-eda/) 



This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* **Plot Fourier Transforms and Spectrograms**
* **Build a simple CNN classifier model over image features**
* **Study the previous competition [BirdCLEF 2021 - Birdcall Identification](https://www.kaggle.com/c/birdclef-2021) and migrate some good models**

---

# Intro


I have started learning speech recognition some days ago.. Lucky me this competition started! An ongoing competition is always very engaging; and engagement helps with the learning process.


This is not an EDA of the competition itself but a small exploration of very simple concepts about handling audio data. We load, explore and analyze a file and, after discovering that an audio is just a 1D np.array we build some nice musical note generator function using `np.sin`. Finally, we will create and play a musical scale. 

In case you are just learning audio processing with DL as I am, you can check the notebooks of my repo [speech-101](https://github.com/dataista0/speech-101/nbs). The first one has a good collection of resources to start diving into the "Speech" world (with previous Kaggle competitions on Audio data). The second one builds a neat speech recognizer that triggers commands on demand using `speechrecognizer` following a youtube tutorial:
* [Day 1 - NB1 - Googling and finding relevant root resources.ipynb](https://github.com/dataista0/speech-101/blob/main/nbs/Day%201%20-%20NB1%20-%20Googling%20and%20finding%20relevant%20root%20resources.ipynb)
* [Day 1 - NB2 - Simple speech recognizer with speechrecognition.ipynb](https://github.com/dataista0/speech-101/blob/main/nbs/Day%201%20-%20NB2%20-%20Simple%20speech%20recognizer%20with%20speechrecognition.ipynb)

You might find these resources that I gathered on Speech Recognition valuable as well:


## Theoretical introductions: 
* <a href="https://www.youtube.com/watch?v=dBAn67ZKbZ4" >Introduction to Deep Learning for Audio and Speech Applications - YouTube</a>
* <a href="https://www.youtube.com/watch?v=RBgfLvAOrss">Stanford Seminar - Deep Learning in Speech Recognition - YouTube</a>
* <a href="https://en.wikipedia.org/wiki/Speech_recognition">Speech Recognition - Wikipedia</a>
* <a href="https://en.wikipedia.org/wiki/Speech_processing">Speech Processing - Wikipedia</a>


## Kaggle speech competitions

* <a href="https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/code?competitionId=7634&amp;sortBy=voteCount">TensorFlow Speech Recognition Challenge | Kaggle</a>
* <a href="https://www.kaggle.com/c/rfcx-species-audio-detection/overview" >Rainforest Connection Species Audio Detection | Kaggle</a>
* <a href="https://www.kaggle.com/c/birdclef-2021/code?competitionId=25954&amp;sortBy=voteCount" >BirdCLEF 2021 - Birdcall Identification | Kaggle</a>


## Other Kaggle resources
* <a href="https://www.kaggle.com/davids1992/speech-representation-and-data-exploration" >Speech representation and data exploration | Kaggle</a>
* <a href="https://www.kaggle.com/alexozerin/end-to-end-baseline-tf-estimator-lb-0-72">End-to-end baseline TF Estimator LB 0.72 | Kaggle</a>
* <a href="https://www.kaggle.com/nandhuelan/wav2vec-wandb-learning-audio-representation/data">Wav2vec+wandb- Learning audio representation 🔥🤗 | Kaggle</a>


## Datasets, models and papers
* <a href="https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html" >Google AI Blog: Launching the Speech Commands Dataset</a>
* <a href="https://huggingface.co/anton-l/wav2vec2-random-tiny-classifier" >anton-l/wav2vec2-random-tiny-classifier · Hugging Face</a>
* <a href="https://huggingface.co/docs/transformers/model_doc/wav2vec2">Wav2Vec2</a>
* <a href="https://arxiv.org/abs/2006.11477" >[2006.11477] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations</a>


## More resources

* <a href="https://www.youtube.com/watch?v=Qf4YJcHXtcY">13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow</a>
* <a href="https://www.youtube.com/watch?v=9GJ6XeB-vMg">Speech Recognition in Python</a>
* <a href="https://www.youtube.com/watch?v=qV4lR9EWGlY">Sound: Crash Course Physics #18</a>
* <a href="https://www.youtube.com/watch?v=PWVH3Vx3dCI">Speech Recognition Using Python | How Speech Recognition Works In Python | Simplilearn</a>
* <a href="https://www.youtube.com/watch?v=spUNpyF58BY">But what is the Fourier Transform? A visual introduction. 3blue 1 brown YT</a>


---



#  Please _DO_ upvote if you found this useful or interesting!


Enough chitchat, let's code!

# Load a file with `torchaudio`

In [None]:
import torchaudio
import pandas as pd

a_file = "../input/birdclef-2022/train_audio/akekee/XC174953.ogg"

In [None]:
# It seems it's a common practice in the audio world to return this tuple of (waveform, sample_rate)
waveform, sample_rate = torchaudio.load(a_file)

# It returns a tensor. Let's go to numpy
waveform = waveform.numpy()

# Samples per second
print(sample_rate)

# See:
* This awesome EDA: [🐦 BirdCLEF 2022: EDA 🐦](https://www.kaggle.com/prokaggler/birdclef-2022-eda)
* The documentation of [torchaudio](https://pytorch.org/audio/stable/torchaudio.html)

In [None]:
# Total samples (channels x samples) actually
print(waveform.shape)

In [None]:
# Let's drop the first dimension and get into pandas world
waveform = pd.Series(waveform[0])
waveform.shape

In [None]:
# Since we have a sample rate of 32000 and 76904 samples, this means that the audio lasts for 2.40325 seconds
seconds = waveform.shape[0] / sample_rate
seconds

In [None]:
# A sound is just a one-dimensional array!|
waveform.plot(figsize=(20, 5), alpha=0.5, color='red', title="First example");

In [None]:
# It looks "dense" because it oscilates a big deal and it's very compresed, but it's just a simple line
waveform[10000:10200].plot(figsize=(20, 5), alpha=0.5, color='red', title="First example - Zoom-in in 100 samples");

In [None]:
# Since the sample rate is 32K, 400 samples is a sound that lasts 6.25 milliseconds
200 / sample_rate

## Using `torchaudio.info`

In [None]:
info = torchaudio.info(a_file)
[attr for attr in dir(info) if not attr.startswith("_")]

In [None]:
for attr in dir(info):
    if not attr.startswith("_"):
        print(f"{attr:<16}= {getattr(info, attr)}")

# Play the sound!

In [None]:
from IPython.display import Audio

In [None]:
# ~2.4 seconds
Audio(waveform, rate=sample_rate)

### I already knew that sound was a 1D array, but now that I see it as an `np.array` it feels great. It seems easy to manipulate as well...

Let's try some stuff

In [None]:
import numpy as np

random_sound = np.random.rand(sample_rate * 2) / 10 # 2 seconds 
pd.Series(random_sound).plot();
Audio(random_sound, rate=sample_rate)

In [None]:
square_sound = np.ones(sample_rate * 2) / 10
square_sound[16000:32000] = 0
square_sound[32000+16000:] = 0
display(pd.Series(square_sound).plot())
Audio(square_sound, rate=sample_rate)

### Let's play a sin wave to "synthetize" sound

This is how a bare musical note can be created from the math world.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,5)

In [None]:
# One sin cycle happens between 0 and 2 pi
x = np.linspace(0, 2*np.pi, 10000)
sin = np.sin(x)
plt.plot(x, sin);
# Cannot hear it. What about you? :P
Audio(sin, rate=sample_rate)

In [None]:
# Increase cycle frequency
x = np.linspace(0, 2*np.pi, sample_rate)
sin = np.sin(10*x)
plt.plot(x, sin);
# Still cannot hear it, but it's below the human perception frequency range
Audio(sin, rate=sample_rate)

The low-pitch/high-pitch spectre in which musical notes live are related to the frequency of the wave. The frequency is: how many cycles fit in one second?

440 Hz means there are 440 cycles in one second

![](https://149695847.v2.pressablecdn.com/wp-content/uploads/2021/07/image-293.png)

Let's create musical notes with numpy!

**References and useful resources**
* [A (musical note)][1]
* [E (musical note)][2]
* [Online tone generator](https://www.szynalski.com/tone-generator/) - This is awesome but be careful with high pitches tones. 

[1]: https://en.wikipedia.org/wiki/A_(musical_note)
[2]: https://en.wikipedia.org/wiki/E_(musical_note)

In [None]:
# Increase cycle frequency to 440 Hz, which is the musical note A
x = np.linspace(0, 2*np.pi, 32000)
sin = np.sin(440*x)
plt.plot(x, sin);
# Play it
Audio(sin, rate=sample_rate)

# Musical notes generator

We can add duration and volume and we have a musical note generator :D :D

The volume is the amplitude of the wave, so it's a multipler of the `sin` output.
The duration of the sound, on the other hand, affects both the `domain` space and the number of samples, so it affects both `linspace` arguments as a factor.

In [None]:
def get_note(frequency=440, volume=1, duration=2, sample_rate=32000, plot_wave=False, display_audio=False):
    x = np.linspace(0, 2*np.pi*duration, int(sample_rate*duration))
    sin = volume * np.sin(frequency * x)
    
    if plot_wave: 
        plt.plot(x, sin)
        plt.show()
    
    if display_audio:
        display(Audio(sin, rate=sample_rate))
    return sin

In [None]:
# This is an A
get_note(frequency=440, display_audio=True);

In [None]:
# This is an E
get_note(frequency=329.63, display_audio=True);

# Let's play a musical scale now

In [None]:
# Get a table with scales from a web page
def get_scale_notes():
    df_notes = pd.read_html("https://pages.mtu.edu/~suits/notefreqs.html")[1]
    df_notes.columns = ['note', 'frequency', 'waveform']
    df_notes = df_notes[['note', 'frequency']]
    mask = (df_notes['note'].str.contains("4") & ~df_notes['note'].str.contains("#")) | (df_notes['note'] == 'C5')
    return df_notes[mask]

scale = get_scale_notes()
scale

In [None]:
scale_sound = np.concatenate([get_note(frequency=frequency, duration=0.3) for frequency in scale['frequency'].tolist()])

# Up!
Audio(scale_sound, rate=sample_rate)

In [None]:
# And down!
Audio(scale_sound[::-1], rate=sample_rate)

#  Please _DO_ upvote if you found this useful or interesting!

# What's next?
You can check the next notebook of the series, [🐦 Audio 101. 2- Detailed EDA](https://www.kaggle.com/julian3833/audio-101-2-detailed-eda/), in which we will do some EDA of the competition.