# Virufy COVID Quickstart

Hello fellow fighters! We welcome you as allies in the battle against COVID-19. This notebook provides a quick tutorial on how to download our data, preprocess it, and quickly get started training models. 

## Part 1: Setup

First, we import some packages. If you are running this in Colab they should all come pre-installed. If you're running this locally, you might need to install these packages first. 

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import librosa
import librosa.display
import cv2
import numpy as np
import json
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Part 2: Data Download

Now, we download the CoughVID data from a different, open-source Virufy repo:

In [None]:
# Download coughvid data in CDF format
# Run once 
!git clone "https://github.com/virufy/virufy-cdf-coughvid.git"
%cd virufy-cdf-coughvid

## Part 3: Data Cleaning
Now we're ready to load our data into memory! 

If you listen to the recordings, you might notice that some of the recordings aren't coughs. To help you, we've already run a model to predict whether the sound file is really a cough. Here, we filter our dataset, keeping only those recordings that are at least 70% likely to be coughs.  

In [None]:
coughvid = pd.read_csv("virufy-cdf-coughvid.csv")
msk = (coughvid.loc[:,'cough_detected'] >= 0.99)
coughvid = coughvid.loc[msk,:]

Let's take a quick look at our labels! 


In [None]:
# Filtering cough_detected to > .7 is advisable
# The .7 threshold can be tuned as part of model development, we recommend testing different thresholds after a model has been completed
coughvid.head()

In [None]:
# Disclaimer: we have inferred some of these pcr_test_result labels based on other columns
# Target = pcr_test_result_inferred
# Positive, negative, untested

coughvid['pcr_test_result_inferred'].head(30)

There are a lot of recordings labeled as 'untested'. These can't be directly used in supervised learning, so for now we'll filter out those labels as well, keeping only the recordings that are 'positive' or 'negative'

In [None]:
# Filter out untested results
msk = (coughvid.loc[:,'pcr_test_result_inferred']=='untested')
coughvid = coughvid.loc[~msk,:]
coughvid


Our cleaned data consists of 5386 recordings, each labelled with 'positive' or 'negative' and a selection of clinical features. 

## Part 4: Data Preprocessing

Now that we have a clean dataset, we split it into train/val. We'll train on the train split and use the val split to decide when to stop training. 

In [None]:
# Test/Train split
stratify_labels = coughvid["pcr_test_result_inferred"].map(lambda x: x if x is "positive" else "untested")
cdf_train, cdf_test = train_test_split(coughvid, test_size=0.2, random_state = 0, stratify = stratify_labels, shuffle=True)

In [None]:
cdf_train.shape, cdf_test.shape

((2218, 16), (555, 16))

In [None]:
sum(cdf_train['pcr_test_result_inferred'] == 'positive'), sum(cdf_train['pcr_test_result_inferred'] == 'negative')

(451, 1767)

In [None]:
sum(cdf_test['pcr_test_result_inferred'] == 'positive'), sum(cdf_test['pcr_test_result_inferred'] == 'negative')

Here, we define our custom preprocessing pipeline. We extract the following relevant audio features:
- Mel-Frequency Cepstral Coefficients (MFCCs) 
- Mel-Spectrograms

We also cache these features so that the preprocessing only needs to be run once. 
Feel free to modify this section in any way you like!

In [None]:
# Functions to process audio files into images and json features
def trim_silence(x, *args):
    try:pad,db_max,frame_length,hop_length = args[0],args[1],args[2],args[3]
    except: 
        print('Please enter the following arguments: pad,db_max,frame_length,hop_length')
        return

    _, ints = librosa.effects.trim(x, top_db=db_max, frame_length=256, hop_length=64)
    start = int(max(ints[0]-pad, 0))
    end   = int(min(ints[1]+pad, len(x)))
    return x[start:end]

def process_cough_file(path,trim,*args):
    try: sr,removeaudio,chunk,db_max = args[0],args[1],args[2],args[3]
    except: 
        sr,removeaudio,chunk,db_max= 22050,False,3,50
    try:
        x,sr = librosa.load(path, sr=sr)       
    except: 
        return -1
    
    if len(x)/sr < 0.3 or len(x)/sr > 30:
        return None,None  
    hop_length = np.floor(0.010*sr).astype(int) #10ms
    win_length = np.floor(0.020*sr).astype(int) #20ms  

    if removeaudio:
        os.remove(path)
    
    x = trim(x, 0.25*sr, db_max,win_length,hop_length) 
    x = x[:np.floor(chunk*sr).astype(int)]
    
    #pads to chunk size if smaller
    x_pad = np.zeros(int(sr*chunk))
    x_pad[:min(len(x_pad), len(x))] = x[:min(len(x_pad), len(x))]

    return [x_pad,sr,hop_length,win_length]

def get_melspec(sdir,audio,sr,name):
    # Mel Spectogram
    audio = librosa.util.fix_length(audio, size=154350)
    melspec  = librosa.feature.melspectrogram(y=audio,sr=sr,n_mels=128, fmax=8000)
    s_db     = librosa.power_to_db(melspec, ref=np.max)
    rawSBD = s_db.T.tolist()
    return rawSBD

def get_rawMFCCs(audio,sr,*args):
    try: hop_length,win_length,n_mfcc,n_mels,n_ftt = args[0],args[1],args[2],args[3],args[4]
    except:
        hop_length = np.floor(0.010*sr).astype(int) #10ms
        win_length = np.floor(0.020*sr).astype(int) #20ms  
        n_mfcc,n_mels,n_ftt=13,13,2048
    
    audio = librosa.util.fix_length(audio, size=154350)
    rawMFCCs = librosa.feature.mfcc(y=audio,sr=sr, n_mfcc=n_mfcc,n_mels=n_mels, n_fft=n_ftt, hop_length=hop_length)
    # rawMFCCs    = np.mean(rawMFCCs.T,axis=0).tolist()
    rawMFCCs = rawMFCCs.T.tolist()

    return rawMFCCs

def getlabel(key, dataframe, chosen):
      return dataframe.loc[dataframe[chosen['id']]==key][chosen['pcr']].tolist()[0]

def extract(df, chosen, savedir):
    if not os.path.isdir(savedir):
        os.mkdir(savedir)
        
    keys, dirs = df[chosen['id']].tolist(),df[chosen['path']].tolist()  
    audio_objs = [process_cough_file(path,trim_silence) for path in dirs]
    false_indices = [i for i in range(len(audio_objs)) if isinstance(audio_objs[i],int) or isinstance(audio_objs[i],tuple)]

    audio_objs = [audio_objs[i] for i in range(len(audio_objs)) if i not in false_indices]
    audio_objs = np.array(audio_objs)
    audio,sr,hop_length,win_length = audio_objs[:,0],audio_objs[:,1],audio_objs[:,2],audio_objs[:,3]
    
    dirs = [dirs[i] for i in range(len(dirs)) if i not in false_indices]
    keys = [keys[i] for i in range(len(keys)) if i not in false_indices]
    data = {
              key:{
                    'DIR':get_melspec(savedir,a_i,sr_i,key),
                    'rawMFCC':get_rawMFCCs(a_i,sr_i),
                    'label':getlabel(key, df, chosen)
                  } for key,a_i,sr_i in list(zip(keys,audio,sr))
            }
    return data

    
def filter_DF(df):
    names = list(df.columns)
    chosen= {}
    for name in names:
        if 'inferred' in name.lower():chosen['pcr'] = name # Choosing the target (pcr_test_result_inferred)
        elif 'path' in name.lower():chosen['path'] = name
        elif 'patient' in name.lower() or 'id' == name.lower() :chosen['id'] = name
    return df[[chosen['id'],chosen['pcr'],chosen['path']]].dropna().reset_index(), chosen 

def extract_features(train_df, test_df, dir_train, dir_test):
    train_dataframe, train_chosen = filter_DF(train_df)
    test_dataframe, test_chosen = filter_DF(test_df)
    
    train_features = extract(train_dataframe, train_chosen, dir_train)
    test_features = extract(test_dataframe, test_chosen, dir_test)
    
    return train_features, test_features


def show_image(_image):
    fig, ax = plt.subplots()
    img = librosa.display.specshow(_image, ax=ax)
    plt.style.use('classic')
    plt.xlabel('time')
    plt.ylabel('frequency')
    fig.colorbar(img, ax=ax)
    ax.set(title='IMG')
    plt.show()
    pass


import pickle
def save_dump(file_path, data, labels):
    file = open(file_path, 'wb')
    # dump information to that file
    pickle.dump((data, labels), file)
    # close the file
    file.close()
    pass


def load_data(path_file):
    file = open(path_file, 'rb')

    # dump information to that file
    (pixels, labels) = pickle.load(file)

    # close the file
    file.close()

    print(pixels.shape)
    print(labels.shape)
    return pixels, labels


def view_chart(performance, people, chart):
    fig, ax = plt.subplots()
    y_pos = np.arange(len(people))
    ax.barh(y_pos, performance, align='center', color=['dodgerblue','orange'])
    for index, value in enumerate(performance):
        plt.text(value, index, str(value))
    ax.set_yticks(y_pos)
    ax.set_yticklabels(people)
    ax.invert_yaxis()
    ax.set_xlabel('Number')
    ax.set_title(chart)
    plt.xlim(0, max(performance) + 400)
    plt.show()

In [None]:
!pip install audiomentations
import IPython.display as ipd

from tqdm import tqdm
from scipy.io.wavfile import write

import matplotlib.pyplot as plt
import numpy as np

from audiomentations import Compose, TimeStretch, PitchShift, Shift, Trim, Gain, PolarityInversion, AddGaussianNoise, BandPassFilter, BandStopFilter
from audiomentations import GainTransition
from audiomentations import SpecCompose, SpecChannelShuffle, SpecFrequencyMask, FrequencyMask
import pandas as pd


def save_csv_data(data_dict, dir):
    uuid = np.array([feat for feat in data_dict])
    image = np.array([data_dict[feat]['DIR'] for feat in data_dict])
    label = np.array([data_dict[feat]['label'] for feat in data_dict])
    metadata_image_mfc = {
        'uuid': uuid,
        'images': image,
        'assessment_result': label,
    }
    df = pd.DataFrame(metadata_image_mfc, columns=['uuid', 'images', 'assessment_result'])
    df.to_csv(dir, index=False, header=True)
    pass

# save_csv_data(train_features, '/content/drive/MyDrive/Data-Covid/virufy-data-train.csv')
# save_csv_data(test_features, '/content/drive/MyDrive/Data-Covid/virufy-data-test.csv')

In [None]:
augment_1 = Compose([
    TimeStretch(min_rate=0.7, max_rate=1.4, p=0.9),
    PitchShift(min_semitones=-2, max_semitones=4, p=1),
    Shift(min_fraction=-0.5, max_fraction=0.5, p=0.8),
    Trim(p=1),
    GainTransition(p=1),
    PolarityInversion(p=0.5),
])

# augment_2 = Compose([
#     TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
#     PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
#     Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
#     GainTransition(p=1),
#     Trim(p=0.5),
#     PolarityInversion(p=0.7),
# ])

# augment_3 = Compose([
#     TimeStretch(min_rate=0.7, max_rate=1.4, p=0.9),
#     PitchShift(min_semitones=-2, max_semitones=4, p=1),
#     Shift(min_fraction=-0.5, max_fraction=0.5, p=0.8),
#     Trim(p=0.5),
#     FrequencyMask(p=0.5),
# ])

# augment_3 = SpecCompose(
#     [
#         SpecChannelShuffle(p=0.5),
#         SpecFrequencyMask(p=0.5),
#         Trim(p=1),
#     ]
# )

augment_global = Compose([
    Trim(p=1),             
])


def data_augmentation(df, chosen, savedir, status):
    if not os.path.isdir(savedir):
        os.mkdir(savedir)
        
    keys, dirs, labels = df[chosen['id']].tolist(), df[chosen['path']].tolist(), df[chosen['pcr']].tolist()

    print("P :", sum(df[chosen['pcr']] == 'positive'))
    print("N :", sum(df[chosen['pcr']] == 'negative'))


    # print(dirs)

    # audio_objs = [process_cough_file(path,trim_silence) for path in dirs]
    # false_indices = [i for i in range(len(audio_objs)) if isinstance(audio_objs[i],int) or isinstance(audio_objs[i],tuple)]

    # audio_objs = [audio_objs[i] for i in range(len(audio_objs)) if i not in false_indices]
    # audio_objs = np.array(audio_objs)

    # # print(audio_objs)

    # audio,sr,hop_length,win_length = audio_objs[:,0],audio_objs[:,1],audio_objs[:,2],audio_objs[:,3]
    
    # dirs = [dirs[i] for i in range(len(dirs)) if i not in false_indices]
    # keys = [keys[i] for i in range(len(keys)) if i not in false_indices]

    # data_path = []
    data_librosa = []
    data_uuid = []
    data_labels = []
    for key, path, label in list(zip(keys, dirs, labels)):
         
        # print(key)
        a_i, sr_i = librosa.load(path, 22050)
        # write(savedir + key + '.wav', sr_i, a_i)

        sr_i = 22050
        data_global = augment_global(a_i, sr_i)
        # img = get_melspec(savedir, data_global, sr_i, key)
        img = get_rawMFCCs(data_global, sr_i)
        # print(np.array(img).shape)
        # data_path.append(key + '.png')
        data_librosa.append(img)
        data_uuid.append(key)
        data_labels.append(label)

        if label == 'positive' and status == True:
            data_aug_1 = augment_1(a_i, sr_i)
            # img_aug_1 = get_melspec(savedir, data_aug_1, sr_i, key + '_aug_1')
            img_aug_1 = get_rawMFCCs(data_aug_1, sr_i)
            # data_path.append(key + '_aug_1.png')
            # print(np.array(img_aug_1).shape)
            data_librosa.append(img_aug_1)
            data_uuid.append(key + '_aug_1')
            data_labels.append(label)
            # print(key + '_aug_1')

            # data_aug_2 = augment_2(a_i, sr_i)
            # img_aug_2 = get_melspec(savedir, data_aug_2, sr_i, key + '_aug_2')
            # data_path.append(key + '_aug_2.png')
            # data_uuid.append(key + '_aug_2')
            # data_labels.append(label)
            # print(img_aug_2)

            # data_aug_3 = augment_3(a_i, sr_i)
            # img_aug_3 = get_melspec(savedir, data_aug_3, sr_i, key + '_aug_3')
            # data_path.append(key + '_aug_3.png')
            # data_uuid.append(key + '_aug_3')
            # data_labels.append(label)
            # print(img_aug_3)

            # write(savedir + key + '_not_noise.wav', sr_i, data_not_noise)
            # write(savedir + key + '_add_noise.wav', sr_i, data_noise)
    return np.array(data_uuid), np.array(data_librosa), np.array(data_labels)


def save_csv_data_audio(data_uuid, data_path, data_labels, savedir, name):
    # df, chosen = filter_DF(train_df)
    # keys, labels = df[chosen['id']].tolist(), df[chosen['pcr']].tolist()
    # data_path = []
    # data_uuid = []
    # data_label = []
    # for key, label in list(zip(keys, labels)):
    #     data_uuid.append(key)
    #     data_path.append(key + '.wav')
    #     data_label.append(label)
    #     if label == 'positive' and status == True:
    #         # print(label)
    #         data_uuid.append(key + '_not_noise')
    #         data_uuid.append(key + '_add_noise')
    #         data_label.append(label)
    #         data_label.append(label)
    #         data_path.append(key + '_not_noise.wav')
    #         data_path.append(key + '_add_noise.wav')
    
    metadata_audio = {
        'uuid': data_uuid,
        'path': data_path,
        'labels': data_labels
    }
    df = pd.DataFrame(metadata_audio, columns=['uuid', 'path', 'labels'])
    print(df)
    df.to_csv(savedir + name, index=False, header=True)


# def process_data_aug(train_df, test_df, dir_train, dir_test, save_train, save_test, csv_train, csv_test):
#     train_dataframe, train_chosen = filter_DF(train_df)
#     test_dataframe, test_chosen = filter_DF(test_df)

#     print(train_dataframe)
#     print(train_chosen)
    
#     keys_train, path_train, labels_train = data_augmentation(train_dataframe, train_chosen, dir_train, True)
#     print(keys_train.shape, path_train.shape, labels_train.shape)
#     save_csv_data_audio(keys_train, path_train, labels_train, save_train, csv_train)

#     print(test_dataframe)
#     print(test_chosen)
#     keys_test, path_test, labels_test = data_augmentation(test_dataframe, test_chosen, dir_test, False)
#     print(keys_test.shape, path_test.shape, labels_test.shape)
#     save_csv_data_audio(keys_test, path_test, labels_test, save_test, csv_test)

#     return keys_train, labels_train, keys_test, labels_test


def process_data_aug_v2(train_df, test_df, dir_train, dir_test, save_train, save_test, csv_train, csv_test):
    train_dataframe, train_chosen = filter_DF(train_df)
    test_dataframe, test_chosen = filter_DF(test_df)

    print(train_dataframe)
    print(train_chosen)
    
    keys_train, data_train, labels_train = data_augmentation(train_dataframe, train_chosen, dir_train, True)
    print(keys_train.shape, data_train.shape, labels_train.shape)
    # save_csv_data_audio(keys_train, path_train, labels_train, save_train, csv_train)

    print(test_dataframe)
    print(test_chosen)
    keys_test, data_test, labels_test = data_augmentation(test_dataframe, test_chosen, dir_test, False)
    print(keys_test.shape, data_test.shape, labels_test.shape)
    # save_csv_data_audio(keys_test, path_test, labels_test, save_test, csv_test)

    return data_train, labels_train, data_test, labels_test


def extract_audio_to_image(file_csv, folder, savedir):
    if not os.path.isdir(savedir):
        os.mkdir(savedir)

    uuid = np.array(pd.read_csv(file_csv, usecols=['uuid'])).T[0]
    labels = np.array(pd.read_csv(file_csv, usecols=['labels'])).T[0]
    audio_file =  np.array(pd.read_csv(file_csv, usecols=['path'])).T[0]

    for id, name in list(zip(uuid, audio_file)):
        y, sr = librosa.load(folder + name)
        path = get_melspec(savedir, y, sr, id)
        print(path)

In [None]:
# aug_train = '/content/drive/MyDrive/virufy_data/audio_train/'
# aug_test = '/content/drive/MyDrive/virufy_data/audio_test/'
# csv_audio_train = '/content/drive/MyDrive/virufy_data/metadata_audio_train.csv'
# csv_audio_test = '/content/drive/MyDrive/virufy_data/metadata_audio_test.csv'

version = 'v1MFCCsF13'

image_train = '/content/drive/MyDrive/virufy_data/image_train_'+ version
image_test = '/content/drive/MyDrive/virufy_data/image_test_'+ version

save_csv = '/content/drive/MyDrive/virufy_data/'

meta_train = 'metadata_train_aug_'+ version +'.csv'
meta_test = 'metadata_test_aug_'+ version +'.csv'

train_data, train_labels, test_data, test_labels = process_data_aug_v2(cdf_train, cdf_test, image_train, image_test, save_csv, save_csv, meta_train, meta_test)

save_dump('/content/drive/MyDrive/virufy_data/data_feature_aug_'+ version +'.data', train_data, train_labels)
save_dump('/content/drive/MyDrive/virufy_data/data_feature_test_aug_'+ version +'.data', test_data, test_labels)
view_chart([sum(train_labels == 'positive'), sum(train_labels == 'negative')], ['positive', 'negative'], 'Chart Data Train')
view_chart([sum(test_labels == 'positive'), sum(test_labels == 'negative')], ['positive', 'negative'], 'Chart Data Test')


In [None]:
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import cv2


def show_image(_image):
    plt.rcParams["figure.figsize"] = (2.24, 2.24)
    fig, ax = plt.subplots()
    img = librosa.display.specshow(_image, ax=ax)
    plt.style.use('classic')
    fig.colorbar(img, ax=ax)
    ax.set(title='IMG')
    plt.show()
    pass


def save_image(_mfc, _path):
    plt.rcParams["figure.figsize"] = (2.24, 2.24)
    fig, ax = plt.subplots()
    librosa.display.specshow(_mfc, ax=ax)
    plt.savefig(_path)
    plt.cla()
    plt.clf()
    plt.close('all')
    pass


def lib_mfc_mean_matrix(_y, _sr):
    mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=256, fmax=16000)
    mfc = librosa.feature.mfcc(S=librosa.power_to_db(mel, ref=np.max), n_mfcc=2000)
    mfc -= np.mean(mfc, axis=0) + 1e-8
    return mfc

from keras.applications.mobilenet_v2 import preprocess_input
from keras.preprocessing import image
def get_image(image_path):
    return cv2.imread(image_path)
    # return cv2.resize(cv2.imread(image_path), dsize=(480, 640))
    # return preprocess_input(image.img_to_array(image.load_img(image_path, target_size=(480, 640, 3))))

def get_melspec(audio,sr):
    # Mel Spectogram
    melspec  = librosa.feature.melspectrogram(y=audio,sr=sr,n_mels=128, fmax=8000)
    s_db     = librosa.power_to_db(melspec, ref=np.max)
    rawSBD = s_db.T.tolist()
    return rawSBD


y, sr = librosa.load('/content/virufy-cdf-coughvid/virufy-cdf-coughvid/0007c6f1-5441-40e6-9aaf-a761d8f2da3b.webm', 22050)
y = librosa.util.fix_length(y, size=154350)
# print(y.shape)
# print(sr)

# mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)
# print(np.array(mel).shape)

img = get_melspec(y, 22050)

print(np.array(img).shape)

# print(img)
# print(img.shape)
# save_image(img, '/content/drive/MyDrive/dataset/DATA_SET_AUDIO/image_test_2.png')
# # # show_image(img)
# image_test = get_image('/content/drive/MyDrive/dataset/DATA_SET_AUDIO/test2.png')
# print(image_test.shape)

(302, 128)
