# Noise Reduction

Reduce background musics, noises and etc while maintain voice activities.

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/noise-reduction](https://github.com/huseinzol05/malaya-speech/tree/master/example/noise-reduction).
    
</div>

<div class="alert alert-info">

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.
    
</div>

<div class="alert alert-warning">

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).
    
</div>

### Dataset

Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast

In [2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline





Cannot import beam_search_ops from Tensorflow Addons, ['malaya.jawi_rumi.deep_model', 'malaya.phoneme.deep_model', 'malaya.rumi_jawi.deep_model', 'malaya.stem.deep_model'] will not available to use, make sure Tensorflow Addons version >= 0.12.0
check compatible Tensorflow version with Tensorflow Addons at https://github.com/tensorflow/addons/releases
  from .autonotebook import tqdm as notebook_tqdm





In [3]:
y, sr = malaya_speech.load('output_44k.wav', sr = 44100)
len(y), sr, len(y) / sr

(27909120, 44100, 632.8598639455782)

So total length is 60 seconds.

In [4]:
import IPython.display as ipd
ipd.Audio(y[:10 * sr], rate = sr)

This audio extracted from https://www.youtube.com/watch?v=blaIfSWf38Q&t=25s&ab_channel=SkolarMalaysia

As you can hear, the audio got introduction music overlapped with speakers. So we want to reduce that introduction music and possibly split the audio into voice and background noise.

### List available deep model

In [5]:
malaya_speech.noise_reduction.available_model()

Unnamed: 0,Size (MB),Quantized Size (MB),SUM MAE,MAE_SPEAKER,MAE_NOISE,SDR,ISR,SAR
unet,78.9,20.0,0.862316,0.460676,0.40164,9.17312,13.92435,13.20592
resnet-unet,96.4,24.6,0.82535,0.43885,0.38649,9.45413,13.9639,13.60276
resnext-unet,75.4,19.0,0.81102,0.44719,0.36383,8.992832,13.49194,13.1321


### Load deep model

```python
def deep_model(model: str = 'resnet-unet', quantized: bool = False, **kwargs):
    """
    Load Noise Reduction deep learning model.

    Parameters
    ----------
    model : str, optional (default='wavenet')
        Model architecture supported. Allowed values:

        * ``'unet'`` - pretrained UNET.
        * ``'resnet-unet'`` - pretrained resnet-UNET.
        * ``'resnext'`` - pretrained resnext-UNET.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.UNET_STFT class
    """
```

In [6]:
model = malaya_speech.noise_reduction.deep_model(model = 'resnet-unet')




### Load Quantized deep model

To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

In [7]:
quantized_model = malaya_speech.noise_reduction.deep_model(model = 'resnet-unet', quantized = True)

Load quantized model will cause accuracy drop.


### Important factor

1. Noise Reduction model trained on 44k sample rate, so make sure load the audio with 44k sample rate.

```python
malaya_speech.load(audio_file, sr = 44100)
librosa.load(audio_file, sr = 44100)
```

2. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.
3. STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU.

In [8]:
%%time

output = model(y)

In [None]:
output

: 

In [None]:
ipd.Audio(output['voice'][:10 * sr], rate = sr)

: 

In [None]:
ipd.Audio(output['noise'][:10 * sr], rate = sr)

: 

Nicely done! How about our quantized model?

In [None]:
%%time

output_quantized = quantized_model(y)
output_quantized

: 

In [None]:
ipd.Audio(output_quantized['voice'][:10 * sr], rate = sr)

: 

In [None]:
ipd.Audio(output_quantized['noise'][:10 * sr], rate = sr)

: 

### Use Pipeline

Incase your audio is too long and you do not want to burden your machine. So, you can use malaya-speech Pipeline to split the audio splitted to 15 seconds, predict one-by-one and combine after that.

In [None]:
p = Pipeline()
pipeline = (
    p.map(malaya_speech.generator.frames, frame_duration_ms = 15000, sample_rate = sr)
    .foreach_map(model)
    .foreach_map(lambda x: x['voice'])
    .map(np.concatenate)
)
p.visualize()

: 

In [None]:
%%time

results = p.emit(y)

: 

In [None]:
results.keys()

: 

In [None]:
ipd.Audio(results['concatenate'][:10 * sr], rate = sr)

: 

### Reference

1. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation, Daniel Stoller, Sebastian Ewert, Simon Dixon, https://arxiv.org/abs/1806.03185
2. SKOLAR MALAYSIA PODCAST, https://www.youtube.com/watch?v=blaIfSWf38Q&t=25s&ab_channel=SkolarMalaysia