# A BALANCED STEREO WIDENING NETWORK FOR HEADPHONES
Author: OLE KIRKEBY

- The network is balanced; there is a constraint on the sum of the magnitude responses of
    - corss-talk (left input to right output, and vice versa)
    - direct paths  (left input to left output, and so on)
    
- In order to ensure that only a minimum of **spectral colouration** is added to the original sound only frequencies below 2kHz are processed. 

- Headphone reproductin potentially offers very good control of what the listener hears.

- Headphone listening can cause
    - in-head localizatin and
    - fatigue
- The reason is not clear but many researchers agree on the theory; the audiotory system struggles with inputs that do not contain the **interaural time- and level differences** generated by physical sound sources.

    - Thus, such converting processing scheme should insert some **binaural cues**. 
    - However, certain types of spatial enhancement post-processing has a detrimental effect on the sound quality.
    - The advantages of a spatial enhancer must be weighted against the blurring and spectral colouration it adds to the source.
 
 
## 2. Filter design and implementation 

### 2.1. Stereo played back over loudspeakers
![four_paths_from_two_loudspeakers_to_listener](images/four_paths_from_two_loudspeakers_to_listener.png)
![four_paths_from_two_loudspeakers_to_listener_diagram](images/four_paths_from_two_loudspeakers_to_listener_diagram.png)

- As illustrated in Figure 2, the direct path and the cross-talk path each has 
    - a frequency-dependent gain, $G_d$ and $G_x$ respectively, and 
    - a frequency-dependent delay, $t$ and $t+itd$ respectively
        - Difference between the two delays is the **interaural time difference** $itd$.
        - interaural time difference $itd$ is strictly speaking also dependent on the frequency but **we will assume that it is constant**. 
        - $itd$ is the most important cue for determining the location of a source in the horizontal plane. **For frontal sources $itd$ is zero, and zero for sources directly to the side of the listener**
        
#### $itd$ value
The value of the interaural time difference $itd$ affects the amount of widening perceived by the listener. The highest value encountered when listening to real sound sources is around 0.7ms(about 30 samples at 44.1kHz). To large $itd$ value, (>1ms), may result very unnatural sound.
        
- Given the size of the human head, we can **set $G_d$ and $G_x$ to one at frequencies below 1 kHz**, and **$G_d$ to two and $G_x$ to zero at high frequencies**. Thus, if neither $G_d$ and $G_x$ vary too rapidly in the transition band, the sum of $G_d$ and $G_x$ is always very close to two.
    - When an object object, such as the human head is positioned in an incident sound field, such as that produced by a loudspeaker, the sound field is not disturbed when the wave length is long compared to the size of the object.
    - At high frequencies, where the wavelength is short compared to the size of the object, there is a pressure build-up on the object's near side, and a pressure attenuation on the object's far side.
    
    
    

### 2.2 Binaural synthesis
Basically, it's an emulation of binaural sensation of loudspeakers on headphones. An **attempt to model natural listening with very good accuracy introduces noticable spectral colouration of the reproduced sound, particularly at frequencies above 3kHz**, and this colouration is unacceptable for high-fidelity music material.

### 2.3 A balanced stereo widening network

The network is balanced, because the sum of the amplitudes of the two outputs is the same as the amplitude of the input. 
![magnitude_response_of_direct_cross](images/magnitude_response_of_direct_cross.png)

**In order to minimise processing artifacts**, in particular comb-filtering of the monophonic component at high frequencies, it is advantageous to **make the low-pass characteristic of $G_x$ more dramatic than the effect it emulates in real life**. Consequently, frequencies above 2kHz are considered *high*. 

-----------------
## IMPLEMENTATION

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Audio, display
from scipy.io import wavfile

In [None]:
"""
Pseudocode
- Cross-talk to a stereo file.
- lowpass 
- allpass group delay
- simple delay for $itd$
"""



In [2]:
source = wavfile.read('data/tonight.wav')

In [3]:
source

(44100, array([[   0,    0],
        [   0,    0],
        [   0,    0],
        ...,
        [1099, 1096],
        [ 283,  207],
        [-695, -761]], dtype=int16))

In [5]:
source[1][:,0]

array([   0,    0,    0, ..., 1099,  283, -695], dtype=int16)

In [13]:
np.expand_dims(source[1][:,0], axis=1)

array([[   0],
       [   0],
       [   0],
       ...,
       [1099],
       [ 283],
       [-695]], dtype=int16)

In [17]:
source[1].shape

(7823928, 2)

In [15]:
np.concatenate((
    np.expand_dims(source[1][:,0], axis=1), np.expand_dims(source[1][:,0], axis=1)
), axis=1).shape

(7823928, 2)

In [46]:
def stereo_to_mono(filepath):
    """Convert a stereo audio wav file to "stereo-mono", 
    i.e. copy one side of audio array to the other side.
    
    """
    fs, samples = wavfile.read(filepath)
    samples_mono = np.expand_dims(samples[:,0], axis=1)
    samples_stereo_mono = np.concatenate(
        (samples_mono, samples_mono), 
        axis=1
    )
    new_filepath = filepath[:-4] + "_stereo_mono.wav"
    wavfile.write(new_filepath, rate=fs, data=samples_stereo_mono)
    print("Stereo")
    display(Audio(data=samples[:fs*20,:].T, rate=fs))
    print("Stereo-mono")
    display(Audio(samples_stereo_mono[:fs*20,:].T, rate=fs))
    return samples_stereo_mono

stereo_to_mono('data/tonight.wav')

Stereo


Stereo-mono


array([[   0,    0],
       [   0,    0],
       [   0,    0],
       ...,
       [1099, 1099],
       [ 283,  283],
       [-695, -695]], dtype=int16)

## References
-