In [1]:
# -*- coding: utf-8 -*-
# author: PythonIT

import numpy as np
from pydub import AudioSegment
import ByteConversion
import binascii

## About this

The goal of this notebook is to help understanding **WAVE** files. I have mainly written it for myself, to visualize and explore, so there is no guarantee that anything is correct. I will be using `pydub` to create interactively playable sound, but vanilla Python (except for `numpy`) will be used to create the examples.

# Understanding WAVE PCM files

**WAVE** is mostly used as a file format to save wave data of audio. It is a subset of Microsoft's **RIFF** file format, which saves data in chunks. A RIFF file begins with a header to specify the data followed by the actual data saved in chunks. WAVE chunks are samles of the audio signal, converted into digital values using PCM - pulse code modulation, a way of storing analog data digitally by sampling the audio signal at constant intervals.

## WAVE file breakdown

A **WAVE** file can be split in 3 important parts - sub-chunks: The **RIFF chunk**, the **Format chunk** and the **Data chunk**. The first one describes the file, the second one the used format and the third one contains the actual data

## Loading a WAVE file

To understand WAVE files, it is best to load one and see for yourself.

In [2]:
# loading the sound file via pydub to play it in the notebook
test_sound_playable = AudioSegment.from_file('test.wav', format='wav')
test_sound_playable

In [3]:
# loading the sound file in binary mode
sound_file = open('test.wav', 'rb')
test_sound = sound_file.read()
sound_file.close()

## The RIFF chunk

The **RIFF chunk** is the header to the file. It describes the file as a RIFF file and specifies further formats as well as the chunk size. It is 12 bytes in size.

In [4]:
# Reading the first bytes in hexadecimal
riff_chunk = ByteConversion.h2h(test_sound[:12])
riff_chunk

'52 49 46 46 fc 40 0a 00 57 41 56 45'

In [5]:
# Converting them by ASCII table
ByteConversion.h2s(riff_chunk)

'R I F F ü @ \n \x00 W A V E'

The first 12 Bytes in the file look like this: `'52 49 46 46 fc 40 0a 00 57 41 56 45'`. These 12 Bytes are split into 3 sub-chunks.
 - The first 4 Bytes (`52 49 46 46`) - **ChunkID** - contain the letters `'R I F F'`, marking the file a RIFF file
 - The following 4 Bytes (`fc 40 0a 00`) - **ChunkSize** - represent the size of the entire file excluding *ChunkID* and *ChunkSize*. Note that these bits are reversed: (00 0a 40 fc)<sub>16</sub> = (671996)<sub>10</sub>. This is exactly 8 Bytes less than the whole file
 - The last 4 bytes (`57 41 56 45`) - **Format** - specify the file format (`'W A V E'`) Depending on the file format the next chunks and sub-chunks could look completely different.

## The FORMAT chunk

The **Format chunk** describes the used format in more detail. For the **WAVE** format there are stored information about the format of the audio, such as *sample rate*, *byte rate* or *channel numbers*. These are used to determine replay variables and read the chunks correctly. The format chunk is usually 24 bytes in size but it can vary. The actual size is saved in the bits after the chunk header

In [6]:
# Reading the format chunk header
frmt_chunk_header = ByteConversion.h2h(test_sound[12:20]) # 8 bytes after the rift chunk
frmt_chunk_header

'66 6d 74 20 10 00 00 00'

In [7]:
# convert to ASCII
ByteConversion.h2s(frmt_chunk_header)

'f m t SPACE \x10 \x00 \x00 \x00'

The format chunk begins with the keyword `fmt ` (`66 6d 74 20`, notice the space) - **Subchunk 1 ID** - the following 4 Bytes - **Subchunk 1 Size** - save the size of the rest of the format chunk (`10 00 00 00`, usually 16 Bytes).

In [8]:
# reading the next 16 bytes
frmt_chunk_body = ByteConversion.h2h(test_sound[20:36])
frmt_chunk_body

'01 00 02 00 80 bb 00 00 00 ee 02 00 04 00 10 00'

The Body of the Format chunk saves a whole lot of audio information:
 - 2 Bytes (`01 00`<sub>16</sub> = `1`<sub>10</sub>) - **AudioFormat**: 1 = PCM (lossless), every other value indicates a compression
 - 2 Bytes (`02 00`<sub>16</sub> = `2`<sub>10</sub>) - **NumChannels**: 1 = Mono, 2 = Stereo, ... - indicates the numbers of audio channels
 - 4 Bytes (`80 bb 00 00`<sub>16</sub> = `48000`<sub>10</sub>) - **SampleRate**: Samples / sec
 - 4 Bytes (`00 ee 02 00`<sub>16</sub> = `192000`<sub>10</sub>) - **ByteRate**: Bytes / sec, indicates how fast data must be read. Equivalent to: *SampleRate* * *NumChannels* * *BitsPerSample* / 8
 - 2 Bytes (`04 00`<sub>16</sub> = `4`<sub>10</sub>) - **BlockAlign**: Number of Bytes per sample pack (one sample for each channel). Equivalent to: *NumChannels* * *BitsPerSample* / 8
 - 2 Bytes (`10 00`<sub>16</sub> = `16`<sub>10</sub>) - **BitsPerSample**: Sample Size

# The DATA chunk

The data chunk now stores the actual audio data. The size of this sub-chunk is also defined in its chunk header.

In [9]:
# reading chunk header
data_chunk_header = ByteConversion.h2h(test_sound[36:44])
data_chunk_header

'64 61 74 61 d8 40 0a 00'

In [10]:
# convert to ASCII
ByteConversion.h2s(data_chunk_header)

'd a t a Ø @ \n \x00'

The data chunk begins with the keyword `data`(`64 61 74 61`) - **Subchunk 2 ID** - followed by the size of the subchunk **Subchunk2 Size** in the next 4 bytes (`d8 40 0a 00`<sub>16</sub> = `671960`<sub>10</sub>)

In [11]:
# reading the first 2 chunk blocks (BlockAlign * 2 = 8 Bytes) of the data
data_chunk_first2 = ByteConversion.h2h(test_sound[44:52])
data_chunk_first2

'd6 ff d6 ff d2 ff d2 ff'

One sample is *BlockAlign* bytes big. These (in our case) 4 bytes are divided into 2 channels, since the file is a stereo file. One channel sample is *BitsPerSample* bits big (16 bits = 2 Bytes). So the left channel of the first sample is `d6 ff`<sub>16</sub> = `65494`<sub>10</sub> which happens to be the same as on the right. Regarding to those values the audio is output.

# Creating our own WAVE file

In order to create our own file we have to get ourselves some audio data. The simplest approach is to just use some sine wave. We can pretty easily generate wave data with `numpy`.

In [12]:
# Set constants
AUDIO_FORMAT = 1 # PCM
NUM_CHANNELS = 2 # Stereo
SAMPLE_RATE = 48000 # Standards are 44100 or 48000 kHz'
SAMPLE_SIZE = 16 # Bits/sample

In [13]:
# Determine FORMAT values
byte_rate = int(SAMPLE_RATE * NUM_CHANNELS * SAMPLE_SIZE / 8)
block_align = int(NUM_CHANNELS * SAMPLE_SIZE / 8)

## Create FORMAT Chunk

In [17]:
# creating bytestrings and concat them
frmt_head_string = b'fmt '
frmt_head_size_string = (16).to_bytes(4, byteorder='little')
audio_format_string = (AUDIO_FORMAT).to_bytes(2, byteorder='little')
num_channels_string = (NUM_CHANNELS).to_bytes(2, byteorder='little')
sample_rate_string = (SAMPLE_RATE).to_bytes(4, byteorder='little')
byte_rate_string = (byte_rate).to_bytes(4, byteorder='little')
block_align_string = (block_align).to_bytes(2, byteorder='little')
bits_per_sample_string = (SAMPLE_SIZE).to_bytes(2, byteorder='little')

frmt_head = frmt_head_string + frmt_head_size_string
frmt_body = audio_format_string + num_channels_string + sample_rate_string + byte_rate_string + block_align_string + bits_per_sample_string

frmt_chunk_string = frmt_head + frmt_body

# comparing this to the head and body of the original file:
print('New:     ', frmt_chunk_string)
print('Original:', test_sound[12:36])

New:      b'fmt \x10\x00\x00\x00\x01\x00\x02\x00\x80\xbb\x00\x00\x00\xee\x02\x00\x04\x00\x10\x00'
Original: b'fmt \x10\x00\x00\x00\x01\x00\x02\x00\x80\xbb\x00\x00\x00\xee\x02\x00\x04\x00\x10\x00'


## Create DATA Chunk

I want to create `time` seconds of a 500 Hz sine wave. Therefore `ByteRate * time` bytes of data is needed.

In [59]:
# save data size
TIME = 2
DATA_SIZE = int(byte_rate * TIME)
FREQ = 550

In [60]:
# creating header bytestrings
data_head_string = b'data'
data_head_size_string = (DATA_SIZE).to_bytes(4, byteorder='little')

data_head_string = data_head_string + data_head_size_string

### Creating data

Numpy will help providing a array of sine values

In [71]:
NUM_SAMPLES = int(SAMPLE_RATE * TIME)
AMPLITUDE = 2 ** SAMPLE_SIZE / 2 - 1

samples = np.arange(NUM_SAMPLES)

signal = np.sin(2 * np.pi * FREQ * samples / SAMPLE_RATE) * AMPLITUDE + AMPLITUDE


data_string = b''

for sample in signal:
    string = (int(sample)).to_bytes(2, byteorder='little')
    data_string += NUM_CHANNELS * string

In [72]:
data_chunk_string = data_head_string + data_string

## Create HEAD chunk

In [73]:
head_id_string = b'RIFF'
head_chunk_size_string = (36 + DATA_SIZE).to_bytes(4, byteorder='little')
head_format_string = b'WAVE'

head_string = head_id_string + head_chunk_size_string + head_format_string

## Bringing it all together

In [75]:
file_bytestring = head_string + frmt_chunk_string + data_chunk_string

file = open('out.wav', 'wb')
file.write(file_bytestring)
file.close()