## data preprocessing 
* 先前以確認且處理過Rick的資料集（英文help與alarm）
* 環境音從yt找了兩組居家的聲音 隨機採剪而成
* 這隻程式碼將進行資料的擴增（在val不改變的前提之下）

* This code is mainly based on Rick's fold 1 clean and check his data is okay.
* I found some problems in the v1. For example :the raw data is not the real raw, and some audio has a different preprocessing.

* We have a new target, so we need to make a new corrected data, and also need to make the automatic flowork data building and analysis. (if have time the EDA is also important)

**<font color=#808080> == original notes == </font>**
- 國83 台45 日104 先只試國語跟之前的 再加上日語的
- 第一份npz 將嘗試用於 single fold 1s
- save name : help3_single_fold_sp_in1s.npz (切分訓練測試驗證)

**<font color=#808080> == New Target == </font>**
- try to build an automatic flowork pipeline for easy to compute in the feature engineering.
(this pipeline not only the data preprocessing but also the feature engineering, model training, model pruning, conversion...)

- new detect target is main of two groups: 
    1.alarm (fire alarm, gas, smoke) and 
    2.help(eng, jap, cha, hak)
- delete moaning
- hak has less data but don't matter, we need to focus on the train set all can detect correctly.

### In the beginning, I picked up some data from the online, Youtube, opendata... and make sure that will be able to train in this project.
- the "help" data is clipped by hand, that are not all about 1s.
- help data we need to clip "start" until 1s.
- the "alarm" data and other(to augment) I think we can randomly clip 1s. but we also need listening 
- to make sure that the 1s is clear and can be recognized.

**the class of others, I think we need to creat an environment sound.** 


In [67]:
DATA_PATH = '/home/sail/sound_project/DATA/using_data_v3/clip_raw'

seed = 1123
sr = 16000

In [68]:
from IPython.display import Audio
from pydub import AudioSegment
import librosa
import librosa.display
from scipy.io import wavfile

from collections  import Counter
import numpy as np
import random
import os

random.seed(seed)

In [69]:
class_dict = {
                'other':0, 'Environment':0, 'alarm': 7,
                'en_help': 1, 'ch_help': 2, 'ja_help': 3, 'tw_help': 4, 'hk_help': 5, 'yue_help':6,
              } 

clip_type = {'alarm': 'random', 'other': 'random',
             'en_help': 'start', 'ch_help': 'start', 
             'ja_help': 'start', 'tw_help': 'start', 
             'hk_help': 'start', 'yue_help':'start',
             }

In [70]:
save_path = '/home/sail/sound_project/DATA/using_data_v3/v3_traindata'

In [71]:
import os

folder_path = f'{save_path}/for_training/train'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

folder_path = f'{save_path}/for_training/test'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

folder_path = f'{save_path}/for_training/val'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

folder_path = f'{save_path}/no_padding_only_clip1s'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

In [72]:
def load_data(wav_path, sr=sr, type='librosa'):
    if type == 'librosa':
        return librosa.load(wav_path, sr=sr)[0]
    elif type == 'wavfile':
        return wavfile.read(wav_path)[1]

def clip_1s(audio, sr=sr, type='start'):
    if type =='start':
        return audio[:sr]
    elif type == 'end':
        return audio[-sr:]
    elif type == 'random':
        start = random.randint(0, len(audio) - sr)
        return audio[start:start+sr]
    else:
        # return audio[type:type+sr]
        raise ValueError('type must be start, end or random.')
    
def long_random_clip(audio, sr, count):
    audio_list = []
    random_time_list = [random.randint(0, len(audio)) for _ in range(count)]
    random_time_list = list(dict.fromkeys(random_time_list))
    for start in random_time_list:
        audio_list.append(audio[start:start+sr])
    return audio_list

def padding_zero(audio, sr=sr, secent=1, type='a'):
    if len(audio) < sr*secent:
        if type=='ab':
            total_padding = sr*secent - len(audio)
            return np.pad(audio, (total_padding // 2, total_padding - (total_padding // 2)), 'constant', constant_values=(0, 0))
        elif type=='a':
            total_padding = sr*secent - len(audio)
            return np.pad(audio, (0, total_padding), 'constant', constant_values=(0, 0))
    else:
        return clip_1s(audio)
    
def add_noise(audio, noise_factor=0.0005):
    noise = np.random.randn(len(audio))
    augmented_audio = audio + noise_factor * noise
    return augmented_audio

# def slow_down_audio(audio, secent=1, rate=0.8):
#     if len(audio) < sr*secent:
#         return padding_zero(audio)
#     else:
#         return clip_1s(librosa.effects.time_stretch(audio, rate=rate))
    

In [73]:
def preprocess_audio(audio):
    if (audio.shape[0] >= sr) & (audio.shape[0] <= sr*2):
        audio_1 = clip_1s(audio, sr=sr, type='start')
        audio_2 = clip_1s(audio, sr=sr, type='end')
        audio_list = [audio_1, audio_2]
    elif (audio.shape[0] > sr*2):
        audio_list = long_random_clip(audio, sr,  int(audio.shape[0]/1.5//sr)+1)
    else:
        audio_list = [audio]
    return audio_list


In [74]:
X, y = [],[]

def process_subfolder(subfolder_path, label):
    for wav_file in os.listdir(subfolder_path):
        if wav_file == 'xx':
            continue
        wav_file_path = os.path.join(subfolder_path, wav_file)
        if wav_file.endswith(('.wav', '.mp3')):
            audio = load_data(wav_file_path, sr, type='librosa')
            X.extend(preprocess_audio(audio))
            y.extend([label]*len(preprocess_audio(audio)))              
        else:
            process_subfolder(wav_file_path, label)
            
for file in ['other', 'alarm', 'help_data']:  # os.listdir(DATA_PATH)
    folder_path = os.path.join(DATA_PATH, file)
    
    for wav_file_0 in os.listdir(folder_path):
        if wav_file_0 == 'xx':
            continue
        wav_file_0_path = os.path.join(folder_path, wav_file_0)
        
        if file == 'other':
            audio = load_data(wav_file_0_path, sr, type='librosa')
            evn_audio = long_random_clip(audio, sr, 100)
            X.extend(evn_audio) 
            y.extend([class_dict['Environment']] * len(evn_audio))
        elif wav_file_0.endswith(('.wav', '.mp3')):
            audio = load_data(wav_file_0_path, sr, type='librosa')
            X.extend(preprocess_audio(audio))
            y.extend([class_dict[file]]*len(preprocess_audio(audio)))
        else:
            process_subfolder(wav_file_0_path, class_dict[wav_file_0])
print(len(X), len(y))            


2721 2721


In [75]:
Counter(y)


Counter({0: 1400, 2: 493, 4: 278, 7: 201, 1: 169, 3: 86, 5: 64, 6: 30})

In [76]:
# # save the data for processing in the feature

for c,i in enumerate(y):
    wavfile.write(f'{save_path}/no_padding_only_clip1s/{[key for key, value in class_dict.items() if value == i][0]}_{Counter(y[:c])[i]}.wav', sr, X[c])


In [77]:
for i, audio_path in enumerate(os.listdir(os.path.join(save_path, 'no_padding_only_clip1s'))):
    audio = padding_zero(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'), secent=1, type='a')
    for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
        if val_name in audio_path:
            wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padA.wav', sr, audio)
    if ('Environment_' in audio_path):
        TF = False
        for _ in range(200):
            num = random.randint(0, Counter(y)[0])
            val_name_ = f'_{num}.wav'
            if val_name_ in audio_path:
                TF = True                
        if TF:
            wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padA.wav', sr, audio)
        else:
            wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padA.wav', sr, audio)

    if ('Environment_' not in audio_path):
        audio = padding_zero(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'), secent=1, type='a')
        for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
            if val_name in audio_path:
                if ('tw_' in audio_path) or ('hk_' in audio_path):
                    wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padA.wav', sr, audio)
                    wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padA.wav', sr, audio)
                    break
                wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padA.wav', sr, audio)
            else:
                wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padA.wav', sr, audio)

        audio = add_noise(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
        audio = padding_zero(audio, secent=1, type='a')
        if ('ja_' in audio_path) and i % 3 == 0:
            wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_an_padA.wav', sr, audio)
        elif ('ja_' not in audio_path):
            wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_an_padA.wav', sr, audio)

    # elif ('alarm_' not in audio_path) and ('Environment_' not in audio_path):
    #     audio = slow_down_audio(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
    #     wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_sl_padA.wav', sr, audio)
    
        



In [78]:
for c,audio_path in enumerate(os.listdir(os.path.join(save_path, 'no_padding_only_clip1s'))):
    if ('alarm_' not in audio_path) and ('Environment_' not in audio_path):
    
        audio = padding_zero(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'), secent=1, type='a')
        for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
            if val_name in audio_path:
                if ('tw_' in audio_path) or ('hk_' in audio_path):
                    wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padAB.wav', sr, audio)
                    wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padAB.wav', sr, audio)
                    break
                wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padAB.wav', sr, audio)

        audio = add_noise(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
        audio = padding_zero(audio, secent=1, type='a')
        for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
            if val_name in audio_path:
                if ('tw_' in audio_path) or ('hk_' in audio_path):
                    wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_an_padAB.wav', sr, audio)
                    wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_an_padAB.wav', sr, audio)
                    break
                # wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_an_padAB.wav', sr, audio)

    if ('alarm_' not in audio_path) and ('Environment_' not in audio_path) and c%2==0:
        continue
        # audio = slow_down_audio(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
        # wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_sl_padAB.wav', sr, audio)

    else:
        if ('Environment_' not in audio_path):
            audio = padding_zero(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'), secent=1, type='a')
            for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
                if val_name in audio_path:
                    if ('tw_' in audio_path) or ('hk_' in audio_path):
                        wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padAB.wav', sr, audio)
                        wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padAB.wav', sr, audio)
                        break
                    wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padAB.wav', sr, audio)
                else:
                    wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_padAB.wav', sr, audio)

            audio = add_noise(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
            audio = padding_zero(audio, secent=1, type='a')
            for val_name in ['_1.wav', '_15.wav', '_6.wav', '_36.wav', '_80.wav', '_74.wav', '_60.wav', '_55.wav', '_44.wav', '_73.wav']:
                if val_name in audio_path:
                    if ('tw_' in audio_path) or ('hk_' in audio_path):
                        wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_an_padAB.wav', sr, audio)
                        # wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_padAB.wav', sr, audio)
                        break
                    # wavfile.write(f'{save_path}/for_training/val/{audio_path[:-4]}_an_padAB.wav', sr, audio)
                else:
                    wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_an_padAB.wav', sr, audio)
            # if ('alarm_' not in audio_path)and (c%2==0):
            #     audio = slow_down_audio(load_data(os.path.join(save_path, 'no_padding_only_clip1s', audio_path), sr, type='librosa'))
            #     wavfile.write(f'{save_path}/for_training/train/{audio_path[:-4]}_sl_padAB.wav', sr, audio)

In [79]:
TEST_path = '/home/sail/sound_project/DATA/using_data_v3/clip_raw/TEST'

In [80]:
for wav_file in os.listdir(TEST_path):
    for i, aud_name in enumerate(os.listdir(os.path.join(TEST_path, wav_file))):
        audio = padding_zero(clip_1s(load_data(os.path.join(TEST_path, wav_file,aud_name))), secent=1, type='ab')
        wavfile.write(f'{save_path}/for_training/val/{wav_file}_TEST_{i}.wav', sr, audio)
        wavfile.write(f'{save_path}/for_training/test/{wav_file}_TEST_{i}.wav', sr, audio)

In [81]:
# save npz

sounds_train, sounds_test, sounds_val = [], [], []
labels_train, labels_test, labels_val = [], [], []

for file in os.listdir(os.path.join(save_path, 'for_training')):
    print(file)
    for wav_path in os.listdir(os.path.join(save_path, 'for_training',file)):
        path = os.path.join(save_path, 'for_training',file, wav_path)
        wav = load_data(path, sr, type='librosa')
        if file == 'train':
            sounds_train.append(wav)
            try:
                labels_train.append(class_dict[wav_path.split('_')[0]])
            except Exception as e:
                labels_train.append(class_dict[wav_path.split('_')[0]+'_'+wav_path.split('_')[1]])            

        elif file == 'val':
            sounds_val.append(wav)
            try:
                labels_val.append(class_dict[wav_path.split('_')[0]])
            except Exception as e:
                labels_val.append(class_dict[wav_path.split('_')[0]+'_'+wav_path.split('_')[1]])

        elif file == 'test':
            sounds_test.append(wav)
            try:
                labels_test.append(class_dict[wav_path.split('_')[0]])
            except Exception as e:
                labels_test.append(class_dict[wav_path.split('_')[0]+'_'+wav_path.split('_')[1]])            


print(len(sounds_train), len(labels_train), len(sounds_val), len(labels_val), len(sounds_test), len(labels_test))

np.savez(r'/home/sail/sound_project/DATA/using_data_v3/data_v3.npz', sounds_train=sounds_train, labels_train=labels_train, sounds_val=sounds_val, 
         labels_val=labels_val, sounds_test=sounds_test, labels_test=labels_test)


val
train
test
8307 8307 211 211 54 54


In [82]:
data = np.load('/home/sail/sound_project/DATA/using_data_v3/data_v3.npz', allow_pickle=True) 

In [83]:
Counter(labels_val)

Counter({4: 40, 5: 31, 2: 30, 3: 30, 1: 30, 7: 24, 0: 20, 6: 6})

In [84]:
Counter(labels_train)


Counter({0: 4142, 2: 1520, 4: 828, 7: 804, 1: 512, 5: 214, 3: 203, 6: 84})

In [85]:
Counter(labels_test)


Counter({2: 10, 4: 10, 1: 10, 5: 10, 3: 10, 7: 4})

In [86]:
max(np.concatenate(sounds_train)), min(np.concatenate(sounds_train))

(1.4524899, -1.4327664)

In [26]:
max(np.concatenate(sounds_train)), min(np.concatenate(sounds_train))

(1.8290101, -2.0462713)

# other test

In [None]:
Audio(np.concatenate(X),rate=sr)

In [None]:
audio0 = load_data('/home/sail/sound_project/DATA/v2.2_traindata/for_training/train/tw_help_11_padA.wav', sr, type='librosa')

In [None]:
Audio(audio0,rate=sr)


In [None]:
Audio(audio,rate=sr)

In [None]:
print(class_dict)
Counter(y)