# REAL TIME PITCH SHIFTING 

## PART 1: INTRODUCTION

**This notebook presents several techniques to perform pitch shifting in real time.
Concretely, the goal is to transform in real time a speech to make it deeper.**

**In particular this notebook will explore the robot voice technique, the basic granular synthesis algorithm and finally a more advanced version of the latter that uses LPC.**

**Of course, it is obviously impossible to achieve pure real time (ie bit per bit) processing. The data will be processed using buffers of the smallest size as possible.**

So, the main problem induced by the real time approach is that the processing has to be very efficient so that there is no big delay between input and output signals. To face this constraint there are a few things that must be done:
    1. using integers value as much as possible.
    2. using precomputed look-up-table (LUT) to avoid useless repetive computations.
    3. coding as in C (close to the machine).

We will need the following libraries to handle all the audio processing to come:

In [21]:
from IPython.display import Image
from IPython.core.display import HTML 

import numpy as np
import IPython
import sounddevice as sd
from scipy.io import wavfile
from matplotlib import pyplot as plt
import scipy.signal as sp

In this notebook, we will use the following wav file :

In [23]:
samp_freq, signal = wavfile.read("speech.wav")

# If the wav file has several channels, just pick one of them
if len(signal.shape)>1 :
    signal = signal[:,0]

IPython.display.Audio(signal, rate=samp_freq)



## PART 2: ROBOT VOICE

In this part, the robot voice implementation is given in python, allowing you to use it here. But you can also find a talk about a C implementation of the process. This is useful if you want to use the robot voice on a board with real microphones.

### What is a SIN table ?

In order to implement the robot voice effect, we only need to multiply the input signals with a sinusoid (ie performing an 'explicit' modulation). However, in real time computing, the sine of a given value cannot be computed as fast as we would like by most computers and microcontrollers. Therefore, we use a lookup table to precompute the unique samples of the sinusoid signal before the computation. Then, during the process, we recall the samples from the memory as needed.

**The following function precomputes the SIN table in python**

In [2]:
# define necessary utility functions
def build_sine_table(f_sine, samp_freq, data_type):
    
    
    # compute the integer conversion parameters
    if data_type == np.int16:
        MAX_SINE = 2**(15)-1
    elif data_type == np.int32:
        MAX_SINE = 2**(31)-1
    
    # periods
    samp_per = 1./samp_freq
    sine_per = 1./f_sine

    # compute the right number of (integer) time instances
    LOOKUP_SIZE = len(np.arange(0, sine_per, samp_per))
    n = np.arange(LOOKUP_SIZE)
    
    
    freq_step = f_sine/samp_freq
    SINE_TABLE = np.sin(2*np.pi*n*freq_step) * MAX_SINE

    return SINE_TABLE, MAX_SINE, LOOKUP_SIZE

**The following function precomputes the SIN table in C**

### Explanation of the process

Below you can see how the ouput is processed such that a robot kind of voice is obtained.

If you ask yourself why the output_buffer is the one being multiplied by the SIN table and not the input_buffer, then you will find your answer in the next part which is talking about DC noise removal.

1. sine_pointer is used as the index of the SINE_TABLE.
2. MAX_SINE is the maximum value of SINE_TABLE.

Integer type variables are used in order to save processing time. But since the sine window has a range that spans the interval [0,1], this is suitable to use float steps such as 0.1,0.2, etc...

Therefore, in order to maximize the precision and minimize at the same time the computation cost, the full range of the int16 (i.e 65535 values in this case) is used. Thus the SINE_TABLE is first multiplied by MAX_SINE in the builder. Then, during the modulation, the SIN_TABLE is divided by MAX_SINE to perform integer arithmetic without losing precision.

### DC noise removal

During the audio capture, the internal circuitry in the microphone may add an offset.Also sometimes different microphones will have different offsets which could be really problematic. 

This shift is typically called the waveform DC noise. For the robot voice effect, a DC noise would result in a constant sinusoid present in the output signal. For you to understand it, below is the real signal you are working on:

$x'[n] = (x[n] + n_{DC}) $

So you obtain via processing:

$x'[n] * sin (w_{mod} * n) $

$= (x[n] + n_{DC}) * sin (w_{mod} * n) $

$= y[n] + n_{DC} * sin (w_{mod} * n)$

Where y[n] are the samples of our desired robot voice effect and n_DC is the level of the DC noise.

To remove the DC noise (that we suppose constant over time) a simple high pass filter is implemented. This filter will remove any DC component of the signal, i.e. bin number 0 of the Discrete Fourier Transform(DFT). A cheap high pass filter of the following form is used to stay efficient:

$$y[n] = x[n] - x[n-1]$$

A problem of this approach is that even though the DC noise is removed, the filter also attenuate frequencies in the range of interest.

### Benchmarking (in C)

The Benchmarking is implemented using a timer to measure real-time processing speed.

Timers always have an input clock with one of the timebases of the microcontroller internal clocks (a quartz for example). This timebase can be either taken directly or reduced by a factor called a [prescaler](https://en.wikipedia.org/wiki/Prescaler). It is important to chose an appropriate prescaler value as it will define how fast the timer counts. For this application, the prescaler value is set as 48 to achieve a 1[μs] period for the Internal Clock i.e. the timer must have a 1[μs] resolution.

In this C implementation, the processing time is around 1800[μs] which is 10% of the buffer time (16000[μs]).

### Hardware (in C)

For the hardware a STM32 Nucleo-64 development board is used with the STM32F072RB microcontroller unit and the I2S MEMS Microphone Breakout by Adafruit. For the DAC (Digital-to-Analog Converter), a Adafruit's I2S Stereo Decoder Breakout is used, which contains the DAC, an audio jack for connecting headphones, and the necessary additional components.

### Python code

Below you can find the python implementation of the robot voice.

**The init function provides all the state variables and creates the SIN table.**

In [3]:
# state variables
def init_robot(f_sine, samp_freq):
    global sine_pointer
    global x_prev
    global GAIN
    global SINE_TABLE
    global MAX_SINE
    global LOOKUP_SIZE

    GAIN = 1
    x_prev = 0
    sine_pointer = 0
    
    # compute SINE TABLE
    SINE_TABLE, MAX_SINE, LOOKUP_SIZE  = build_sine_table(f_sine, samp_freq, data_type)

**The process function takes the input buffer (raw voice) and fills the output buffer with the pitch shiffted voice.**

In [4]:
def process_robot(input_buffer, output_buffer, buffer_len):

    # specify global variables modified here
    global x_prev
    global sine_pointer

    for n in range(buffer_len):
        
        # high pass filter
        output_buffer[n] = input_buffer[n] - x_prev

        # modulation
        output_buffer[n] = output_buffer[n] * SINE_TABLE[sine_pointer]/MAX_SINE

        # update state variables
        sine_pointer = (sine_pointer+1)%LOOKUP_SIZE
        x_prev = input_buffer[n]

**We can use this functions either in real-time or to process a wav file. Here is the main function for a wav file:**

In [19]:
"""
You can tweak the following parameters to play with the function
"""
buffer_len = 256
modulation_freq = 350
input_wav = "speech.wav"


samp_freq, signal = wavfile.read(input_wav)

# If the wav file has several channels, just pick one of them
if len(signal.shape)>1 :
    signal = signal[:,0]
    
n_buffers = len(signal)//buffer_len
data_type = signal.dtype

# allocate input and output buffers
input_buffer = np.zeros(buffer_len, dtype=data_type)
output_buffer = np.zeros(buffer_len, dtype=data_type)

"""
Nothing to touch after this!
"""

init_robot(modulation_freq, samp_freq)
signal_proc = np.zeros(n_buffers*buffer_len, dtype=data_type)

for k in range(n_buffers):

    # index the appropriate samples
    input_buffer = signal[k*buffer_len:(k+1)*buffer_len]
    process_robot(input_buffer, output_buffer, buffer_len)
    signal_proc[k*buffer_len:(k+1)*buffer_len] = output_buffer

# write to WAV
wavfile.write("result_robot.wav", samp_freq, signal_proc)
print("Done !")

IPython.display.Audio(signal_proc, rate=samp_freq)



Done !


**Right below is the code you can use to transform your own voice in real time.**

In [6]:
# parameters
buffer_len = 256
modulation_freq = 500
data_type = np.int16
samp_freq = 44100

try:
    sd.default.samplerate = 16000
    sd.default.blocksize = buffer_len
    sd.default.dtype = data_type

    def callback(indata, outdata, frames, time, status):
        if status:
            print(status)
        process_robot(indata[:,0], outdata[:,0], frames)

    init_robot(modulation_freq, samp_freq)
    with sd.Stream(channels=1, callback=callback):
        print('#' * 80)
        print('press Return to quit')
        print('#' * 80)
        input()
except KeyboardInterrupt:
    parser.exit('\nInterrupted by user')

################################################################################
press Return to quit
################################################################################



## PART 3: GRANULAR SYNTHESIS

### Main idea : the resampling technique

With this method, the pitch shifting is not achieved by an *explicit* modulation like the one used for the robot voice.


Based on the input signal, the goal is to **create through interpolation new samples at a higher rate **.

Concretely, those new samples will be separated by a period $T_s'$ that is smaller than the original period $T_s$.

Let's take an example. Suppose the pitch factor is 0.75, ie you want to have a deeper voice.

- The first block of data contains 10 samples at times $[0,1,2,3,4,5,6,7,8,9]$  $ ms$

- The output will contain interpolated values of the input at times $0.75 \times [0,1,2,3,4,5,6,7,8,9]$  $ ms$

**We must note 3 things here**:


- Note 1: The interpolation is **linear**. So, $interpolatedValue(t=2.25)$ = $0.75 \times input(t=2) + 0.25 \times input(t=3)$ 


- Note 2: The last interpolation time is $t = 9 \times 0.75 = 6.75$. So there might be **losses of information** (raw samples at times $t = [8,9]$ are not used !).


- Note 3: The 10 output samples will be played at the **same rate $f_s$** than the one of the 10 input samples.

$\to$ It took initially $6.75$ $ms$ to record the information embedded in the $[0,6.75]$ interval. But after the transformation, it takes $9$  $ ms$ to play the same (resampled) information.
    
    
$\to$ **the audio has been _stretched_ by a factor 0.75**, making the output sound deeper than the original.

### Problem

**Because of the loss of information, there might be discontinuities between output blocks.**


This is an annoying artifact since discontinuities in an audio file result in "tick" noises that alter the overall audio quality.

### Solution : use overlapping grains

The trick is to use overlapping chunks of samples that are called $grains$. 

Be careful : 'overlapping' does not mean that output samples are a mix between consecutive input blocks.

Instead, it means that **some raw samples will be processed twice in row.**

- The first time as the last samples of a grain
- The second time as the first samples of the next grain

---------
As a general rule (and as in the previous example), interpolated values are computed for each grain at times $grainStart + k \times shiftFactor \times T_s$ 

- $T_s$ the sampling period
- $k$ an integer number ranging from 0 to the number of samples in the grain.


--------------------------


**Important :** 

For samples belonging to $2$ consecutive grains, $2$ different interpolated values are computed (ie one for each grain). Hence, it is necessary to find a way of **combining those two values** so that there is not brutal discontinuity between output samples. This is achieved by using a **tapered window**.

A tapered window is simply an array that associates a multiplicative factor to each resampled (ie interpolated) value.
- For zones with no overlap, this factor is simply 1
- For zones with overlap, this factor is a number in $[0,1]$ that follows a linear function
    - This function is increasing for reampled values at the start of a grain
    - This function is decreasing for resampled values at the end of a grain


**Concretely :**
When there are two different resampled values for the same output sample:
- $resampled\_value_1$ is scaled by the tapered window of its grain evaluated at time $t=output\_sample\_time$
- $resampled\_value_2$ is scaled by the tapered window of its grain evaluated at time $t=output\_sample\_time$
- output sample = $scaled\_resampled\_value_1$ + $scaled\_resampled\_value_2$


--------------
As an image is worth a thousand words, let's have a look at the following example

In [7]:
Image(url= "https://i.postimg.cc/ZqLH95hQ/resampling-times.png")

On the figure above, each square represents 1 time unit ($TU$).

- The input samples are represented in dark blue. We can see that $T_s=6$  $TU$
- The pitch shifting factor is $\frac{11}{12}$ so this implies that $T_s'=5.5$ $TU$

For both grains, we can see that resampling times (pink and light blue ticks on the x-axis) are placed at integer multiples of $T_s'$ from the start of the grains.

We can see that the **raw** samples at times $t=[36, 42, 48, 54]$ $TU$ belong to two different grains and will thus be processed twice.
- The first time to compute $sample_{pink}(t= [38.5, 44, 49.5])$
- The second time to compute $sample_{blue}(t= [41.5, 47, 52.5])$
--------------
**Note** :
- Here we only show the first two grains of the audio stream, so the first samples of $grain_1$ are exclusively used for $grain_1$ (since there is no $grain_0$ !)
- With the same idea, the last samples of the last grain of the audio stream will also be used exclusively for this last grain.

-------------

On the figure below, you can see all the resampled (interpolated) values computed for every resample time.

In [8]:
Image(url="https://i.postimg.cc/KYbQvgCL/resampled-values.png")

The last thing to do in this example is to scale and merge the interpolated values of $grain_1$ and $grain_2$ that overlap.

**Concretely** :
- $sample_{pink}(t=38.5)$ and $sample_{blue}(t=41.5)$ must be merged to compute $sample_{output}(t=42)$


- As $window_{pink}(t=42) = \frac{2}{3}$ and $window_{blue}(t=42) = \frac{1}{3}$


- $sample_{output}(t=42)$ =  $window_{pink}(t=42) \times sample_{pink}(t=38.5) + window_{blue}(t=42) \times sample_{blue}(t=41.5)$


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$= \frac{2}{3} \times sample_{pink}(t=38.5) + \frac{1}{3} \times sample_{blue}(t=41.5)$ 


--------------------
**Important Note : do not forget that the samples blocks arrive one after the other ! This implies that :** 
- In order to compute the output samples corresponding to the last raw samples of the grain, it is necessary to wait for the next sample block. When the latter arrives, the second grain can be formed and the output sample can finally be computed, as all the overlapping samples (ex : pink **and** light blue) are available.


- Each newly arrived sample bloc needs to have access to:
    1. The last **raw** samples of the preceding block $\to$ build the new grain
    2. The last resampled values (pink) of the preceding block $\to$ compute the output samples for the overlapping part at the start of this new grain.
    
    
**In our example **: Everytime a block arrives, only $6$ output samples can be directly computed. The remaining $3$ resampling values (at the end of the block) need to be combined with the first $3$ resampling values that will be computed in the next grain.

${\to}$ This value $6$ is refered to as the $stride$.


-----------------


### From theory to implementation

**Now that the algorithm has been explained, it is time to understand the implications of every constraint in the implementation**

###### Constant parameters

- Number of samples in a grain $\to$ $GRAIN\_LEN\_SAMP$
- Number of output samples that can be produced without the need of the next grain $\to$ $STRIDE$
- The tapering window of length $GRAIN\_LEN\_SAMP$ $\to$ $WIN$
    - In fact the length of the mixing edges of the window can be modified (it will obviously modify $STRIDE$ value )

Interpolation :
- Resampling times with respect to the start of the grain (pink and light-blue time indices) $\to$ $SAMP\_VALS$
- The amplitude factor associated to the preceding raw sample for each resampling time $\to$ $AMP\_VALS$ (necessary to perform linear interpolation)

    - When we have : $interpolatedValue(t=2.25)$ = $0.75 \times input(t=2)+0.25 \times input(t=3)$
    - &nbsp;&nbsp;  $SAMP\_VALS(t=2.25) = 2$
    - &nbsp;&nbsp;  $AMP\_VALS(t=2.25) = 0.75$


    
###### What arrives at each iteration
- A sample block of &nbsp;$STRIDE$&nbsp; samples


###### To be passed between each sample blocks
- The last raw samples of the previous block $\to$ $PREVIOUS\_RAW$
    - Note : those samples concatenated with the &nbsp;$STRIDE$ samples of the arriving block makes a full grain
    
    
- The last fully processed resampled values (window scaling included) of the previous block $\to$ $PREVIOUS\_DOWN\_WINDOWED$ 

###### Locally used variables

- The grain formed by the incoming raw sample block concatenated with the $PREVIOUS\_RAW$ samples $\to$ $GRAIN$
- The array containing the resampled values and then the window-scaled resampled values $\to$ $RESAMPLED\_GRAIN$

### Python implementation

In [9]:
def ms2smp(ms, fs):
    """
    Parameters
    ----------
    ms: float
        Time in milliseconds
    fs: float
        Sampling rate in Hz.
    """
    # seconds = ms/1000
    return int(fs*ms/1000.)

In [10]:
def win_taper(N, a, data_type=np.int16):

    """
    Parameters
    ----------
    N: the length of the grain (in samples)
    a: a double between 0 and 1 representing the fraction of the N samples that will be attenuated
        (a/2 samples are attenuated on both sides)
    data_type: the data type of the output
        
    output: a profile represented by values that span the entire positive range of "data_type", that will modify the sound
    samples at the beginning and the start of grains so that we can make grains overlap without problem."""
    
    
    # Number of samples that are attenuated on each side
    nb_attenuated = int(N * a / 2)
    
    # Create the increasing "ramp"
    ramp = np.arange(0, nb_attenuated) / float(nb_attenuated)
    
    # Create the final profile by concatenating increasing ramp, untouched samples and decreasing ramp
    win = np.concatenate((ramp, 
        np.ones(N-2*nb_attenuated), 
        ramp[::-1]))
    
    
    return win

def compute_stride(N, a):
    return N - int(N * a / 2) - 1

In [11]:
def build_linear_interp_table(n_samples, down_fact, data_type=np.int16):
    
    #previous existing sample
    which_samples = []
    #fractional amplitude.
    amplitudes = []
    for n in range(n_samples):
        
        # The interpolation time
        t = n*down_fact
        
        # The largest integer smaller than t
        N = np.floor(t)
        
        # The amplitude that should have this latest sample
        # (if t = 1.01 s then the amplitude of the sample at time N=1 sould be 0.99)
        a = 1-(t-N)
        
        which_samples.append(int(N))
        amplitudes.append(a)

    return which_samples, amplitudes

#### Specific functions to use the LPC coefficients feature (see this [gitbook](https://dsp-labs-v1.gitbook.io/project/v/gitbook_hosting/_intro-3) )

In [12]:
def bac(x, p):
    # compute the biased autocorrelation for x up to lag p
    L = len(x)
    r = np.zeros(p+1)
    for m in range(0, p+1):
        for n in range(0, L-m):
            r[m] += x[n] * x[n+m]
        r[m] /= float(L)
    return r

def ld(r, p):
    # solve the toeplitz system using the Levinson-Durbin algorithm
    g = r[1] / r[0]
    a = np.array([g])
    v = (1. - g * g) * r[0];
    for i in range(1, p):
        g = (r[i+1] - np.dot(a, r[1:i+1])) / v
        a = np.r_[ g,  a - g * a[i-1::-1] ]
        v *= 1. - g*g
    # return the coefficients of the A(z) filter
    return np.r_[1, -a[::-1]]

def lpc(x, p):
    # x is a GRAIN
    # p is a scalar P=20
    
    return ld(bac(x, p), p)

#### Init function

In [13]:
# state variables and constants
def init(GRAIN_LEN_SAMP, a, shift_factor, data_type):
    
    int16 = np.int16

    # lookup table for tapering window (FLOATS)
    global WIN
    
    # Number of coefficients for LPC
    global P 

    # lookup table for linear interpolation 
    global SAMP_VALS
    global AMP_VALS
    
    # To be passed between different iterations (arrays of INT16 elements)
    global PREVIOUS_RAW
    global PREVIOUS_DOWN_WINDOWED
    
    # To process each iteration (arrays of INT16 element)
    global GRAIN
    global RESAMPLED_GRAIN
    
    WIN = win_taper(GRAIN_LEN_SAMP, a, data_type)
    SAMP_VALS, AMP_VALS = build_linear_interp_table(GRAIN_LEN_SAMP, shift_factor, data_type)
    
    P = 20
    
    PREVIOUS_RAW = np.zeros(int(GRAIN_LEN_SAMP*a/2) + 1, dtype=int16)
    PREVIOUS_DOWN_WINDOWED = np.zeros(int(GRAIN_LEN_SAMP*a/2) + 1, dtype=int16)
    
    GRAIN = np.zeros(GRAIN_LEN_SAMP,dtype=int16)
    RESAMPLED_GRAIN = np.zeros(GRAIN_LEN_SAMP, dtype=int16)

#### Process function

In [14]:
def process(input_buffer, output_buffer, buffer_len, GRAIN_LEN_SAMP, OVERLAP_LEN, MAX_VAL,shift_factor, use_LPC=False):

    global PREVIOUS_RAW
    global PREVIOUS_DOWN_WINDOWED
    global GRAIN
    global SAMP_VALS
    global AMP_VALS
    global RESAMPLED_GRAIN
    global WIN
    global P

    # append raw samples from previous buffer : Those samples are going to be used a second time now
    # recall GRAIN contains float between -1 and 1
    for n in range(GRAIN_LEN_SAMP):
        if n < OVERLAP_LEN:
            GRAIN[n] = PREVIOUS_RAW[n]
        else:
            GRAIN[n] = input_buffer[n - OVERLAP_LEN]
    
    
    # If we use LPC, the GRAIN variable will be modified after this for loop
    # So before it gets modified,
    # we must store the last samples of the actual raw grain so that they can be used a second time along with next grain
    for n in range(STRIDE, GRAIN_LEN_SAMP):
            PREVIOUS_RAW[n - buffer_len] = GRAIN[n]
        
        
    if use_LPC and GRAIN[0] != 0 :
        a = lpc(np.float32(GRAIN), P) # Compute coefs 
        GRAIN = sp.lfilter(a, [1.], GRAIN) # Modify the grain so that it contains the excitation signal
        

        
    # Use interpolation to resample
    for n in range(GRAIN_LEN_SAMP):
        coeff = AMP_VALS[n]
        RESAMPLED_GRAIN[n] = coeff * GRAIN[SAMP_VALS[n]] + (1-coeff) * GRAIN[SAMP_VALS[n]+1]

        
    if use_LPC  and GRAIN[0] != 0:
        # Now that resampled the excitation pattern, we can apply the reverse filter
        # so that the overall energy envelope of the spectrum is not modified through resampling
        RESAMPLED_GRAIN = sp.lfilter([1], a, RESAMPLED_GRAIN)
        
    
    
    # apply window
    for n in range(GRAIN_LEN_SAMP):
        RESAMPLED_GRAIN[n] = RESAMPLED_GRAIN[n] * WIN[n]
    
    # Write in output buffer and update buffers for next grain
    for n in range(GRAIN_LEN_SAMP):
        
        # overlapping part
        if n < OVERLAP_LEN:
            output_buffer[n] = RESAMPLED_GRAIN[n] + PREVIOUS_DOWN_WINDOWED[n]
            
        # non-overlapping part
        elif n < STRIDE:
            output_buffer[n] = RESAMPLED_GRAIN[n]
            
        # update state variables for next iterations
        else:
            PREVIOUS_DOWN_WINDOWED[n - buffer_len] = RESAMPLED_GRAIN[n]

#### File processing cell without LPC

In [24]:
""" User selected parameters """
input_wav = "speech.wav"
N = 20      # in milliseconds
a = 0.3    # grain overlap (0,1)
shift_factor = 0.7  # < 1.0

""" Nothing to touch after this!"""

# open WAV file
samp_freq, signal = wavfile.read(input_wav)

if len(signal.shape)>1 :
    signal = signal[:,0] # get one channel
    
data_type = signal.dtype
MAX_VAL = np.iinfo(data_type).max

# derived parameters
GRAIN_LEN_SAMP = ms2smp(N, samp_freq)
STRIDE = compute_stride(GRAIN_LEN_SAMP, a)
OVERLAP_LEN = GRAIN_LEN_SAMP - STRIDE

# allocate input and output buffers
input_buffer = np.zeros(STRIDE, dtype=data_type)
output_buffer = np.zeros(STRIDE, dtype=data_type)


init(GRAIN_LEN_SAMP, a, shift_factor, data_type)
n_buffers = len(signal)//STRIDE
signal_proc = np.zeros(n_buffers*STRIDE, dtype=data_type)

with_LPC = False


for k in range(n_buffers):

    start_idx = k*STRIDE
    end_idx = (k+1)*STRIDE

    input_buffer = signal[start_idx:end_idx]
    process(input_buffer, output_buffer, STRIDE, GRAIN_LEN_SAMP, OVERLAP_LEN, MAX_VAL, shift_factor, use_LPC=with_LPC)
    signal_proc[start_idx:end_idx] = output_buffer

# write to WAV

if with_LPC:
    file_name = "result_GS_with_LPC.wav"
else:
    file_name = "result_GS_without_LPC.wav"
    
print("Result written to: %s" % file_name)
wavfile.write(file_name, samp_freq, signal_proc)

IPython.display.Audio(signal_proc, rate=samp_freq)



Result written to: result_GS_without_LPC.wav


### File processing cell with LPC

In [None]:
""" User selected parameters """
input_wav = "speech.wav"
N = 20      # in milliseconds
a = 0.3    # grain overlap (0,1)
shift_factor = 0.7  # < 1.0

""" Nothing to touch after this!"""

# open WAV file
samp_freq, signal = wavfile.read(input_wav)

if len(signal.shape)>1 :
    signal = signal[:,0] # get one channel
    
data_type = signal.dtype
MAX_VAL = np.iinfo(data_type).max

# derived parameters
GRAIN_LEN_SAMP = ms2smp(N, samp_freq)
STRIDE = compute_stride(GRAIN_LEN_SAMP, a)
OVERLAP_LEN = GRAIN_LEN_SAMP - STRIDE

# allocate input and output buffers
input_buffer = np.zeros(STRIDE, dtype=data_type)
output_buffer = np.zeros(STRIDE, dtype=data_type)


init(GRAIN_LEN_SAMP, a, shift_factor, data_type)
n_buffers = len(signal)//STRIDE
signal_proc = np.zeros(n_buffers*STRIDE, dtype=data_type)

with_LPC = True


for k in range(n_buffers):

    start_idx = k*STRIDE
    end_idx = (k+1)*STRIDE

    input_buffer = signal[start_idx:end_idx]
    process(input_buffer, output_buffer, STRIDE, GRAIN_LEN_SAMP, OVERLAP_LEN, MAX_VAL, shift_factor, use_LPC=with_LPC)
    signal_proc[start_idx:end_idx] = output_buffer

# write to WAV

if with_LPC:
    file_name = "result_GS_with_LPC.wav"
else:
    file_name = "result_GS_without_LPC.wav"
    
print("Result written to: %s" % file_name)
wavfile.write(file_name, samp_freq, signal_proc)

IPython.display.Audio(signal_proc, rate=samp_freq)

#### Real Time processing cell

In [16]:
""" User selected parameters """
N = 30
a = 0.2

# CHANGE THIS VALUE TO HAVE MORE/LESS DEEP VOICE
shift_factor = 0.6

""" Nothing to touch after this!"""

data_type = np.int16
samp_freq = 16000

# derived parameters
MAX_VAL = np.iinfo(data_type).max
GRAIN_LEN_SAMP = ms2smp(N, samp_freq)

STRIDE = compute_stride(GRAIN_LEN_SAMP, a)
OVERLAP_LEN = GRAIN_LEN_SAMP-STRIDE

# allocate input and output buffers
input_buffer = np.zeros(STRIDE, dtype=data_type)
output_buffer = np.zeros(STRIDE, dtype=data_type)

with_LPC = True


try:
    sd.default.samplerate = 16000
    sd.default.blocksize = STRIDE
    sd.default.dtype = data_type
    print(data_type)

    def callback(indata, outdata, frames, time, status):
        if status:
            print(status)
        process(indata[:,0], outdata[:,0], frames, GRAIN_LEN_SAMP, OVERLAP_LEN, MAX_VAL,shift_factor, use_LPC=with_LPC)

    
    init(GRAIN_LEN_SAMP, a, shift_factor, data_type)
    
    with sd.Stream(channels=1, callback=callback):
        print('#' * 80)
        print('press Return to quit')
        print('#' * 80)
        input()
        
except KeyboardInterrupt:
    parser.exit('\nInterrupted by user')

<class 'numpy.int16'>
################################################################################
press Return to quit
################################################################################
input overflow, output underflow

