# REAL TIME PITCH SHIFTING 

## PART 1: INTRODUCTION

**This notebook presents several techniques to perform pitch shifting in real time.
Concretely, the goal is to transform in real time the voice of a person by making it deeper.**

**In particular this notebook will explore the robot voice technique, the basic granular synthesis algorithm and finally a more advanced version of the latter that uses LPC.**

**Of course, it is obviously impossible to achieve pure real time (ie bit per bit) processing. The data will be processed using buffers of the smallest size as possible.**

So, the main problem induced by the real time approach is that the processing has to be very efficient so that there is no big delay between input and output signals. To face this constraint there are a few things that must be done:
    1. using integers value as much as possible.
    2. using precomputed look-up-table (LUT) to avoid useless repetive computations.
    3. coding as in C (close to the machine).

We will need the following libraries to handle all the audio processing to come:

In [1]:
import numpy as np
import sounddevice as sd
from scipy.io import wavfile
from matplotlib import pyplot as plt
import scipy.signal as sp

## PART 2: ROBOT VOICE

### What is a SIN table ?

In [2]:
#### text about sin table, explain why this is a constant

The following function precomputes the SIN table. 

In [3]:
# define necessary utility functions
def build_sine_table(f_sine, samp_freq, data_type):
    
    
    # compute the integer conversion parameters
    if data_type == np.int16:
        MAX_SINE = 2**(15)-1
    elif data_type == np.int32:
        MAX_SINE = 2**(31)-1
    
    # periods
    samp_per = 1./samp_freq
    sine_per = 1./f_sine

    # compute the right number of (integer) time instances
    LOOKUP_SIZE = len(np.arange(0, sine_per, samp_per))
    n = np.arange(LOOKUP_SIZE)
    
    
    freq_step = f_sine/samp_freq
    SINE_TABLE = np.sin(2*np.pi*n*freq_step) * MAX_SINE

    return SINE_TABLE, MAX_SINE, LOOKUP_SIZE

In [4]:
#### direct component explanation

In [5]:
####explain idea of modulation

In [6]:
#### explanation about the c board

**The init function provides all the state variables and creates the SIN table.**

In [7]:
# state variables
def init(f_sine, samp_freq):
    global sine_pointer
    global x_prev
    global GAIN
    global SINE_TABLE
    global MAX_SINE
    global LOOKUP_SIZE

    GAIN = 1
    x_prev = 0
    sine_pointer = 0
    
    # compute SINE TABLE
    SINE_TABLE, MAX_SINE, LOOKUP_SIZE  = build_sine_table(f_sine, samp_freq, data_type)

**The process function takes the input buffer (raw voice) and fills the output buffer with the pitch shiffted voice.**

In [8]:
def process(input_buffer, output_buffer, buffer_len):

    # specify global variables modified here
    global x_prev
    global sine_pointer

    for n in range(buffer_len):
        
        # high pass filter
        output_buffer[n] = input_buffer[n] - x_prev

        # modulation
        output_buffer[n] = output_buffer[n] * SINE_TABLE[sine_pointer]/MAX_SINE

        # update state variables
        sine_pointer = (sine_pointer+1)%LOOKUP_SIZE
        x_prev = input_buffer[n]

**We can use this functions either in real-time or to process a wav file. Here is the main function for a wav file:**

In [None]:
"""
You can tweak the following parameters to play with the function
"""
buffer_len = 256
modulation_freq = 350
input_wav = "speech.wav"


samp_freq, signal = wavfile.read(input_wav)

# If the wav file has several channels, just pick one of them
if len(signal.shape)>1 :
    signal = signal[:,0]
    
n_buffers = len(signal)//buffer_len
data_type = signal.dtype

# allocate input and output buffers
input_buffer = np.zeros(buffer_len, dtype=data_type)
output_buffer = np.zeros(buffer_len, dtype=data_type)

"""
Nothing to touch after this!
"""

init(modulation_freq, samp_freq)
signal_proc = np.zeros(n_buffers*buffer_len, dtype=data_type)

for k in range(n_buffers):

    # index the appropriate samples
    input_buffer = signal[k*buffer_len:(k+1)*buffer_len]
    process(input_buffer, output_buffer, buffer_len)
    signal_proc[k*buffer_len:(k+1)*buffer_len] = output_buffer

# write to WAV
wavfile.write("speech_mod.wav", samp_freq, signal_proc)
print("Done !")

**Right below is the code you can use to transform your own voice in real time.**

In [10]:
# parameters
buffer_len = 256
modulation_freq = 500
data_type = np.int16
samp_freq = 44100

try:
    sd.default.samplerate = 16000
    sd.default.blocksize = buffer_len
    sd.default.dtype = data_type

    def callback(indata, outdata, frames, time, status):
        if status:
            print(status)
        process(indata[:,0], outdata[:,0], frames)

    init(modulation_freq, samp_freq)
    with sd.Stream(channels=1, callback=callback):
        print('#' * 80)
        print('press Return to quit')
        print('#' * 80)
        input()
except KeyboardInterrupt:
    parser.exit('\nInterrupted by user')

################################################################################
press Return to quit
################################################################################



## PART 3: GRANULAR SYNTHESIS

### Main idea

With this method, the pitch shifting is not achieved by an *explicit* modulation like the one used for the robot voice.


Based on the input signal, the goal is to **create and use new samples at a higher rate with a technique that uses interpolation**.

Concretely, those new samples will be separated by a period $T_s'$ that is smaller than the original period $T_s$.

Let's take an example. Suppose the pitch factor is 0.75, ie you want to have a deeper voice.

- The first block of data contains 10 samples at times $[0,1,2,3,4,5,6,7,8,9]$  $ ms$

- The output contains the interpolated values of the input at times $0.75*[0,1,2,3,4,5,6,7,8,9]$  $ ms$
    - Note 1: The interpolation is **linear**. So, $interpolatedValue(t=2.25)$ = $0.75*input(t=2) + 0.25*input(t=3)$ 
    - Note 2: The last interpolation time is $t = 9*0.75 = 6.75$. So there might be **losses of information**.
    - Note 3: The 10 output samples will be played at the same rate $f_s$ than the input samples were recorded. In other words, it would initially take $6.75*f_s$ $ms$ to play the information embedded in the $[0,6.75]$ interval. In the output it would now take $9*f_s$  $ ms$ to play the same information.
    
    
With this example, we can see that **the audio has been stretched by a factor 0.75**, making the output sound deeper than the original.

### Problem

**Because of the loss of information, there might be discontinuities between output blocks. **


This is an annoying artifact since discontinuities in an audio file result in "tick" noises that alter the overall audio quality.

### Solution : use overlapping grains

The trick is to use overlapping blocks of samples that are called 'grains'. Be careful : here, overlapping does not mean that output samples are a mix between consecutive input blocks.Instead, it means that **some samples will be processed twice in row.**

- The first time as the last samples of a grain
- The second time as the first samples of the next grain

As in the previous example, **interpolated values are computed for each grain at times $grainStart + k*shiftFactor*T_s$  ** with $T_s$ the sampling period and $k$ an integer number ranging from 0 to the number of samples in the grain.

**Important :** Where two grains overlap, two "families" of interpolated values are compute (ie one for each grain). In order to avoid discontinuities between them, it is necessary to use a tappering window on those overlapping zones.

This method uses chunks of data called "grains". The pitch shifting is not achieved by an explicit modulation as in the previous case but with upsampling. The upsampled signal is obtained from the raw input using linear interpolation.

- expliquer que les temps d'interpolations restent les mêmes (premiere look up table samp_vals)
- expliquer la deuxieme look up table (amp_vals) coefficients multiplicateurs
- IMAGE

- expliquer comment fonctionne les grains (ce qu'ils contiennent) et que l'on ne peut pas simplement process grain par grain
- IMAGES

- besoind d'utiliser une window (pq cette window)
- expliquer la zone d overlap qui se fait sur les memes samples mais a des temps d'interpolation differents. On applique la down window sur la fin du premier grain et la up window sur le debut de second pour faire un transition smooth entre les grains. Les windows s appliquent sur les valeur interpolées 
- image

- expliquer les buffers que l'on utilise.


