# Preprocess

## General

Consulting the International Phonetics Alphabet (IPA) recommended resources, we've reached to the selection of a very well documented dataset, called Buckeye Speech Corpus (https://buckeyecorpus.osu.edu/). It consists of a collection of speeches, which are composed by nearly 300,000 words delivered by 40 different, English language speakers. For testing and scopus purpose, we're going to use only 10 of them for the moment. 

This dataset is well suited for our main tasks also because it comes with a detailed and time-labeled phonetic transcription, which means we can use the timestamps to cut the raw audio exactly on the phonemes borders. 
Since the comes in fragments - and also with more data that, in our case, is not that relevant - we'll need to create ourselves a structured dataframe, that can be used to achieve both of our goals: to select a sound based on the symbolic phonetics obtained by user's standard-text inputs; and to train a Transformer Architecture so it can recognize our own voices' phonetics specifications, and separate it as expected.  

Considering our needs, we'll be build the following table-structure dataframe:

ID  | Phone (written-form)      | Filepath    | Speaker          | MFCC          | ToBeDecided Audio Features 
--- | ------------------------- | ----------- | ---------------  | ------------- |
Int | String, Categorical       | String      | Int, Categorical | Structure     |
     


## Utility Functions (Text2Phone Part)

Considering one speaker, we need to do the following for each of the available recordings:
- Go through the phone descriptors file, collecting it's beginning (and subsequent ending) time, also with the phonetics themselves;
- Use this intermediate information to slice the raw-speech-archive in actual individual files;
- Create the dataset structure, which will lead to a better DataFrame modelling and organization as well;
- Given a phone speech fragment, we should be able to compute its MFCC (and other features).

### Read and create/extend phonetic dataframe given a path

In [183]:
import pandas as pd
import os

In [184]:
# Some sounds or recording issues were signaled with specific symbols, which we should ignore
RAW_PATH = os.path.join("..", "Dataset", "Buckeye", "Raw")

In [185]:
""" Given a single phones filepath, should map all the information into a intermediary DataFrame """

def build_one_file_df(path) -> pd.DataFrame:
    
    df = pd.DataFrame(columns=["phone","recording_label", "start_time", "duration"])

    with open(path, 'r') as f:
        lines = f.readlines()

        recording_label = lines[0].split()[1].rstrip('.sd')

        lines = lines[9:len(lines)-1] # where the records starts and ends, always;

        for j,l in enumerate(lines, start=1):
            
            start_time, _ ,phone = l.strip().split()
            start_time = float(start_time)

            next_index = df.index.max() + 1

            if j == 1:
                row = [phone, recording_label, start_time, 0]
                df.loc[j] = row
            elif j == len(lines)-1:
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                break
            else:
                # asserts the duration of the previous phone
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                row = [phone, recording_label, start_time, 0]
                df.loc[next_index] = row
            
    return df  


In [186]:
# Function simple test
test_raw_path = os.path.join(RAW_PATH,"s01","s0101a.phones")
test_df = build_one_file_df(path=test_raw_path)
test_df.head(5)

Unnamed: 0,phone,recording_label,start_time,duration
1,{B_TRANS},s0101,0.102385,4.173359
2,SIL,s0101,4.275744,4.238019
3,NOISE,s0101,8.513763,23.702812
4,IVER,s0101,32.216575,0.160018
5,k,s0101,32.376593,0.245452


### Cleaning disposable phones

In [187]:
DISPOSABLE_PHONES = ['{B_TRANS}', 'SIL', 'NOISE', 'IVER', 'VOCNOISE', '<EXCLUDE-name>','LAUGH']

In [188]:
def clean_df_phones(df:pd.DataFrame):
    
    indexes_to_remove = []
    for index, row in df.iterrows():
        if row["phone"] in DISPOSABLE_PHONES:
            indexes_to_remove.append(index)
    
    df = df.drop(indexes_to_remove)

    return df

In [189]:
# Test
test_df = clean_df_phones(test_df)

### Creating basic structural functions (filetrees)

In [190]:
DESTINATION_BASE_PATH = os.path.join("..", "Dataset","Buckeye", "Processed")

In [191]:
def create_directories_if_none(phones):
    for p in phones:
        folder_path = os.path.join(DESTINATION_BASE_PATH, p)
        if os.path.isdir(folder_path) == False:
            os.mkdir(folder_path)

In [192]:
# Test
create_directories_if_none(test_df["phone"])

### Cutting sound in phonetic pieces given individual audio source

In [193]:
from pydub import AudioSegment
from typing import List,Tuple
import random
import time

In [194]:
""" 
    
    Given one dataframe and a raw audio file, it should slice the audio accordingly in multiple smaller segments. 
    Their name will follow the pattern: <PHONE>_<SPEAKER-NUM>_<RECORDING_NUM>_<HASHKEY>, and each will be saved 
    in its matching phone folder.
    They need a unique identifier (because a single recording can have multiple entries of the same phoneme), so 
    to be easier, I've used timestamps and a random num to prevent from matching the same numbers (although technically
    it doesn't avoid, but for now, it's more than enough)
    The list of the saved locations are then returned so can be added in the main Dataframe
    
    REMEMBER TO SEE IF THEY HAVE THE SAME BIT RATE AND SAMPLE RATE  

 """


def apply_phonetic_slicing(df:pd.DataFrame,audio_raw_path:str) -> List[str]:

    recording_label = df.iat[1,df.columns.get_loc("recording_label")]

    # Also have to select based on the audio file (recording label)
    audio = AudioSegment.from_file(audio_raw_path, format="wav")
    timestamps = zip(df["start_time"].values,df["duration"].values)
    phones = df["phone"].values

    path_names = []

    for i, (start_time,duration) in enumerate(timestamps):
        
        # the segments cuts need to be in milisseconds
        end_time = (start_time + duration)*1000
        segment = audio[start_time*1000:end_time]

        phone_destination_folder = os.path.join(DESTINATION_BASE_PATH, phones[i])
        audio_clip_path = os.path.join(phone_destination_folder, f"{phones[i]}_{recording_label}_{time.time()-random.uniform(0,1):3f}")

        path_names.append(audio_clip_path)
        segment.export(f"{audio_clip_path}.wav", format="wav")

    return path_names

In [195]:
# Test
test_paths = apply_phonetic_slicing(test_df,os.path.join("..","Dataset","Buckeye","Raw","s01","s0101a.wav"))

## Text2Phone 

Now we'll create the scripts to indeed build the main database for the program. As outputs, we expect to have a Pandas Dataframe containing audio clip locations grouped by phones - which, themselves, will also be separated in folders to better organization and ease of modification.

### Main Dataframe

In [None]:
prod_df = pd.DataFrame(columns=["phone","speaker","recording_label","audio_path"])

### Collecting all phone and audio sources

A support nested-dictionary structure will be created here, in order to ease the process iterations.

In [None]:
SPEAKERS = [f"s0{i}" if i != 10 else f"s{i}" for i in range(1,11)]

PHONE_AUDIO_REFS = {}
for s in SPEAKERS:
    raw_speaker_path = os.path.join(RAW_PATH, s) 
    wav_files, phone_files = [],[]

    for filename in os.listdir(raw_speaker_path):
        if filename.endswith(".wav"):
            wav_files.append(filename)

        elif filename.endswith(".phones"):
            phone_files.append(filename)
        else:
            continue

    PHONE_AUDIO_REFS[s] = tuple([phone_files,wav_files])

### Slicing everything and adding data to Dataframe

In [None]:
for speaker in SPEAKERS:
    for audio_id in AUDIO_FILES[speaker]:
        # CUT EACH AUDIO FILE 
        # NAME THE SAVING PLACE ACCORDINGLY
        # FIT THE DATA (THE NEW WAV PATH IN THE DATAFRAME, MAYBE NEW COLUMN)
        path = os.path.join("Base", speaker, f"{speaker}{audio_id}.wav")
        apply_phonetic_slicing() # proly should return this to use towards the dataframe update (or create a wrapper to call both)
        
        

NameError: name 'AUDIO_FILES' is not defined