# Preprocess

## General

Consulting the International Phonetics Alphabet (IPA) recommended resources, we've reached to the selection of a very well documented dataset, called Buckeye Speech Corpus (https://buckeyecorpus.osu.edu/). It consists of a collection of speeches, which are composed by nearly 300,000 words delivered by 40 different, English language speakers. For testing and scopus purpose, we're going to use only 10 of them for the moment. 

This dataset is well suited for our main tasks also because it comes with a detailed and time-labeled phonetic transcription, which means we can use the timestamps to cut the raw audio exactly on the phonemes borders. 
Since the comes in fragments - and also with more data that, in our case, is not that relevant - we'll need to create ourselves a structured dataframe, that can be used to achieve both of our goals: to select a sound based on the symbolic phonetics obtained by user's standard-text inputs; and to train a Transformer Architecture so it can recognize our own voices' phonetics specifications, and separate it as expected.  

Considering our needs, we'll be build the following table-structure dataframe:

ID  | Phone (written-form)      | Filepath    | Speaker          | MFCC          | ToBeDecided Audio Features 
--- | ------------------------- | ----------- | ---------------  | ------------- |
Int | String, Categorical       | String      | Int, Categorical | Structure     |
     


## Utility Functions (Text2Phone Part)

Considering one speaker, we need to do the following for each of the available recordings:
- Go through the phone descriptors file, collecting it's beginning (and subsequent ending) time, also with the phonetics themselves;
- Use this intermediate information to slice the raw-speech-archive in actual individual files;
- Create the dataset structure, which will lead to a better DataFrame modelling and organization as well;
- Given a phone speech fragment, we should be able to compute its MFCC (and other features).

### Read and create/extend phonetic dataframe given a path

In [2]:
import pandas as pd
import os

In [3]:
# Some sounds or recording issues were signaled with specific symbols, which we should ignore
RAW_PATH = os.path.join("..", "Dataset", "Buckeye", "Raw")

In [4]:
""" Given a single phones filepath, should map all the information into a intermediary DataFrame """

def build_one_file_df(path) -> pd.DataFrame:
    
    df = pd.DataFrame(columns=["phone","speaker","recording_label", "start_time", "duration"])

    with open(path, 'r') as f:
        lines = f.readlines()

        
        recording_label = lines[0].split()[1].rstrip('.sd')
        speaker = recording_label[:3]

        lines = lines[9:len(lines)-1] # where the records starts and ends, always;

        for j,l in enumerate(lines, start=1):
            
            try:
                start_time, _ ,phone = l.strip().split()
            except ValueError:
                print(j, "======",l)

            start_time = float(start_time)

            next_index = df.index.max() + 1

            if j == 1:
                row = [phone, speaker, recording_label, start_time, 0]
                df.loc[j] = row
            elif j == len(lines)-1:
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                break
            else:
                # asserts the duration of the previous phone
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                row = [phone, speaker, recording_label, start_time, 0]
                df.loc[next_index] = row
            
    return df  


In [5]:
# Function simple test
""" test_raw_path = os.path.join(RAW_PATH,"s01","s0101a.phones")
test_df = build_one_file_df(path=test_raw_path)
test_df.head(5) """

' test_raw_path = os.path.join(RAW_PATH,"s01","s0101a.phones")\ntest_df = build_one_file_df(path=test_raw_path)\ntest_df.head(5) '

### Cleaning disposable phones

In [6]:
DISPOSABLE_PHONES = ['{B_TRANS}','{E_TRANS}', 'SIL', 'NOISE', 'IVER', 'VOCNOISE', '<EXCLUDE-name>','LAUGH','UNKNOWN','<exclude-Name>']

In [7]:
def clean_df_phones(df:pd.DataFrame):
    
    indexes_to_remove = []
    for index, row in df.iterrows():
        if row["phone"] in DISPOSABLE_PHONES:
            indexes_to_remove.append(index)
    
    df = df.drop(indexes_to_remove)

    return df

In [8]:
""" # Test
test_df = clean_df_phones(test_df) """

' # Test\ntest_df = clean_df_phones(test_df) '

### Creating basic structural functions (filetrees)

In [9]:
DESTINATION_BASE_PATH = os.path.join("..", "Dataset","Buckeye", "Processed")

In [10]:
def create_directories_if_none(phones):
    for p in phones:
        folder_path = os.path.join(DESTINATION_BASE_PATH, p)
        if os.path.isdir(folder_path) == False:
            os.mkdir(folder_path)

In [11]:
""" # Test
create_directories_if_none(test_df["phone"]) """

' # Test\ncreate_directories_if_none(test_df["phone"]) '

### Cutting sound in phonetic pieces given individual audio source

In [12]:
from pydub import AudioSegment
from typing import List,Tuple
import random
import time

TARGET_SAMPLE_RATE = 44100

In [13]:
""" 
    
    Given one dataframe and a raw audio file, it should slice the audio accordingly in multiple smaller segments. 
    Their name will follow the pattern: <PHONE>_<SPEAKER-NUM>_<RECORDING_NUM>_<HASHKEY>, and each will be saved 
    in its matching phone folder.
    They need a unique identifier (because a single recording can have multiple entries of the same phoneme), so 
    to be easier, I've used timestamps and a random num to prevent from matching the same numbers (although technically
    it doesn't avoid, but for now, it's more than enough)
    The list of the saved locations are then returned so can be added in the main Dataframe
    
    REMEMBER TO SEE IF THEY HAVE THE SAME BIT RATE AND SAMPLE RATE  

 """


def apply_phonetic_slicing(df:pd.DataFrame,audio_raw_path:str) -> List[str]:

    recording_label = df.iat[1,df.columns.get_loc("recording_label")]

    
    audio = AudioSegment.from_file(audio_raw_path, format="wav")
    audio = audio.set_frame_rate(TARGET_SAMPLE_RATE)
    audio = audio.normalize()

    timestamps = zip(df["start_time"].values,df["duration"].values)
    phones = df["phone"].values

    path_names = []

    for i, (start_time,duration) in enumerate(timestamps):
        
        # the segments cuts need to be in milisseconds
        end_time = (start_time + duration)*1000
        segment = audio[start_time*1000:end_time]

        phone_destination_folder = os.path.join(DESTINATION_BASE_PATH, phones[i])
        audio_seg_path = f"{phones[i]}_{recording_label}_{time.time()-random.uniform(0,1):3f}.wav"
        audio_clip_destination = os.path.join(phone_destination_folder, audio_seg_path)
        audio_database_path_name = os.path.join(phones[i],audio_seg_path)

        path_names.append(audio_database_path_name)
        segment.export(audio_clip_destination, format="wav")

    return path_names

In [14]:
""" # Test
test_paths = apply_phonetic_slicing(test_df,os.path.join("..","Dataset","Buckeye","Raw","s01","s0101a.wav")) """

' # Test\ntest_paths = apply_phonetic_slicing(test_df,os.path.join("..","Dataset","Buckeye","Raw","s01","s0101a.wav")) '

In [15]:
""" test_paths """

' test_paths '

### Agregating the wav file locations to the resulting dataframe of this individual process

In [16]:
def concatenate_result_wav_path(df:pd.DataFrame, wav_paths:List[str])->pd.DataFrame:
    df['wav_path'] = wav_paths
    return df

In [17]:
""" # Test
test_df = concatenate_result_wav_path(test_df, test_paths) """

' # Test\ntest_df = concatenate_result_wav_path(test_df, test_paths) '

## Text2Phone 

Now we'll create the scripts to indeed build the main database for the program. As outputs, we expect to have a Pandas Dataframe containing audio clip locations grouped by phones - which, themselves, will also be separated in folders to better organization and ease of modification.

### Main Dataframe

In [18]:
prod_df = pd.DataFrame(columns=["phone","speaker","recording_label"])
# the other columns (start_time, duration and wav_path) are attached during the preprocessing 

### Collecting all phone and audio sources

A support nested-dictionary structure will be created here, in order to ease the process iterations.

In [19]:
SPEAKERS = [f"s0{i}" if i < 10 else f"s{i}" for i in range(1,13)]

PHONE_AUDIO_REFS = {}
for s in SPEAKERS:
    raw_speaker_path = os.path.join(RAW_PATH, s) 
    wav_files, phone_files = [],[]

    for filename in os.listdir(raw_speaker_path):
        if filename.endswith(".wav"):
            wav_files.append(filename)

        elif filename.endswith(".phones"):
            phone_files.append(filename)
        else:
            continue

    PHONE_AUDIO_REFS[s] = tuple([phone_files,wav_files])

In [20]:
print(PHONE_AUDIO_REFS)

{'s01': (['s0101a.phones', 's0101b.phones', 's0102a.phones', 's0102b.phones', 's0103a.phones'], ['s0101a.wav', 's0101b.wav', 's0102a.wav', 's0102b.wav', 's0103a.wav']), 's02': (['s0201a.phones', 's0201b.phones', 's0202a.phones', 's0202b.phones', 's0203a.phones', 's0203b.phones', 's0204a.phones', 's0204b.phones', 's0205a.phones', 's0205b.phones', 's0206a.phones'], ['s0201a.wav', 's0201b.wav', 's0202a.wav', 's0202b.wav', 's0203a.wav', 's0203b.wav', 's0204a.wav', 's0204b.wav', 's0205a.wav', 's0205b.wav', 's0206a.wav']), 's03': (['s0301a.phones', 's0301b.phones', 's0302a.phones', 's0302b.phones', 's0303a.phones', 's0303b.phones', 's0304a.phones', 's0304b.phones', 's0305a.phones', 's0305b.phones', 's0306a.phones'], ['s0301a.wav', 's0301b.wav', 's0302a.wav', 's0302b.wav', 's0303a.wav', 's0303b.wav', 's0304a.wav', 's0304b.wav', 's0305a.wav', 's0305b.wav', 's0306a.wav']), 's04': (['s0401a.phones', 's0401b.phones', 's0402a.phones', 's0402b.phones', 's0403a.phones', 's0403b.phones', 's0404a.phon

### Script for processing all of the files

In [21]:
for speaker,(phone_files, wav_files) in PHONE_AUDIO_REFS.items():
    for i in range(0,len(wav_files)):
        
        phone_f = phone_files[i]
        print(phone_f)
        wav_f = wav_files[i]
        
        local_base_path = os.path.join(RAW_PATH, speaker)
        phone_whole_path = os.path.join(RAW_PATH, speaker, phone_f)
        wav_whole_path = os.path.join(RAW_PATH, speaker, wav_f)

        local_df = build_one_file_df(phone_whole_path)
        local_df = clean_df_phones(local_df)
        create_directories_if_none(local_df["phone"])
        wav_clips_paths = apply_phonetic_slicing(local_df,wav_whole_path)
        local_final_df = concatenate_result_wav_path(local_df, wav_clips_paths)

        prod_df = pd.concat([prod_df, local_df], ignore_index=True)
        

s0101a.phones


s0101b.phones
s0102a.phones
s0102b.phones
s0103a.phones
s0201a.phones
s0201b.phones
s0202a.phones
s0202b.phones
s0203a.phones
s0203b.phones
s0204a.phones
s0204b.phones
s0205a.phones
s0205b.phones
s0206a.phones
s0301a.phones
s0301b.phones
s0302a.phones
s0302b.phones
s0303a.phones
s0303b.phones
s0304a.phones
s0304b.phones
s0305a.phones
s0305b.phones
s0306a.phones
s0401a.phones
s0401b.phones
s0402a.phones
s0402b.phones
s0403a.phones
s0403b.phones
s0404a.phones
s0501a.phones
s0501b.phones
s0502a.phones
s0502b.phones
s0503a.phones
s0503b.phones

s0504a.phones
s0601a.phones
s0601b.phones
s0602a.phones
s0602b.phones
s0603a.phones
s0701a.phones
s0701b.phones
s0702a.phones
s0702b.phones
s0703a.phones
s0703b.phones
s0704a.phones
s0801a.phones
s0801b.phones
s0802a.phones
s0802b.phones
s0803a.phones
s0803b.phones
s0804a.phones
s0901a.phones
s0901b.phones
s0902a.phones
s0902b.phones
s0903a.phones
s0903b.phones
s1001a.phones
s1001b.phones
s1002a.phones

s1002b.phones
s1003a.phones
s1003b.phones
s100

### Sort by phone

Since the application idea is to recover random samples of a given phone, it would be interesting to our dataset to be sorted in this manner.

In [22]:
prod_df = prod_df.sort_values(by=["phone", "speaker", "recording_label"]).reset_index(drop=True)

In [23]:
prod_df.to_csv("production_data.csv",sep=",")