# Preprocess

## General

Consulting the International Phonetics Alphabet (IPA) recommended resources, we've reached to the selection of a very well documented dataset, called Buckeye Speech Corpus (https://buckeyecorpus.osu.edu/). It consists of a collection of speeches, which are composed by nearly 300,000 words delivered by 40 different, English language speakers. For testing and scopus purpose, we're going to use only 10 of them for the moment. 

This dataset is well suited for our main tasks also because it comes with a detailed and time-labeled phonetic transcription, which means we can use the timestamps to cut the raw audio exactly on the phonemes borders. 
Since the comes in fragments - and also with more data that, in our case, is not that relevant - we'll need to create ourselves a structured dataframe, that can be used to achieve both of our goals: to select a sound based on the symbolic phonetics obtained by user's standard-text inputs; and to train a Transformer Architecture so it can recognize our own voices' phonetics specifications, and separate it as expected.  

Considering our needs, we'll be build the following table-structure dataframe:

ID  | Phone (written-form)      | Filepath    | Speaker          | MFCC          | ToBeDecided Audio Features 
--- | ------------------------- | ----------- | ---------------  | ------------- |
Int | String, Categorical       | String      | Int, Categorical | Structure     |
     


## Utility Functions (Text2Phone Part)

Considering one speaker, we need to do the following for each of the available recordings:
- Go through the phone descriptors file, collecting it's beginning (and subsequent ending) time, also with the phonetics themselves;
- Use this intermediate information to slice the raw-speech-archive in actual individual files;
- Create the dataset structure, which will lead to a better DataFrame modelling and organization as well;
- Given a phone speech fragment, we should be able to compute its MFCC (and other features).

### Read and create/extend phonetic dataframe given a path

In [2]:
import pandas as pd
import os

In [3]:
# Some sounds or recording issues were signaled with specific symbols, which we should ignore
DISPOSABLE_PHONES = ['{B_TRANS}', 'SIL', 'NOISE', 'IVER', 'VOCNOISE', '<EXCLUDE-name>']

In [4]:
def collect_phones_and_timestamps(path:str) -> pd.DataFrame:
    
    df = pd.DataFrame(columns=["phone","recording_label", "start_time", "duration"])

    with open(path, 'r') as f:
        lines = f.readlines()

        recording_label = lines[0].split()[1].rstrip('.sd')

        lines = lines[9:len(lines)-1] # where the records starts and ends, always;

        for index,l in enumerate(lines):
            
            start_time, _ ,phone = l.strip().split()
            start_time = float(start_time)

            if index == 0:
                row = [phone, recording_label, start_time, 0]
                df.loc[len(df.index)] = row
            elif index == len(lines)-1:
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                break
            else:
                # asserts the duration of the previous phone
                prev_duration = start_time - df.loc[df.index[-1], 'start_time']
                df.loc[df.index[-1], 'duration'] = prev_duration
                row = [phone, recording_label, start_time, 0]
            
    return df  


In [6]:
new = collect_phones_and_timestamps(path="..\Dataset\Raw\s01\s0101a.phones")

FileNotFoundError: [Errno 2] No such file or directory: '..\\Dataset\\Raw\\s01\\s0101a.phones'

In [None]:
### Has to remove the "PROIHIBITED THINGS"

### Cutting sound in phonetic pieces given individual audio source

In [8]:
from pydub import AudioSegment
from typing import List, Tuple

DESTINATION_BASE_PATH = "../Processed"
#PHONE_DIRS = { key:[] for key in phones_list}

In [9]:
def create_directories() -> None:
    for phone in PHONE_DIRS.keys():
        os.mkdir(os.path.join(DESTINATION_BASE_PATH, phone))

In [None]:
""" 
    Given a single file name and a list of tuples containing start and endtime, it should slice the audio
    accordingly in multiple smaller segments. They are saved under the same folder name, put in "Processed" directory.
    Their name will follow the pattern: <PHONE>_<SPEAKER-NUM>_<RECORDING_NUM>
    
    REMEMBER TO SEE IF THEY HAVE THE SAME BIT RATE AND SAMPLE RATE  

 """

def extract_segments_generator(audio, timestamps):
    for start_time, end_time in timestamps:
        yield audio[start_time:end_time]


def apply_phonetic_slicing(audio_file:str, df:pd.DataFrame) -> None:

    # Also have to select based on the audio file (recording label)
    audio = AudioSegment.from_file(audio_file, format="wav")
    timestamps = zip(df.loc["start_time"],df.loc["end_time"]) # STILL HAVE TO SELECT ONLY THE AUDIO FILE ROWS!!!
    phones = df.loc["phone"] # THE SAME THING MISSING

    # think of a better way of traversal, since we will need every single information to name the file properly

    for i, segment in enumerate(extract_segments_generator(audio, timestamps)):
        
        phone_specific_folder = PHONES_FOLDERS[phone_list[i]]

        segment.export(os.path.join(DESTINATION_BASE_PATH, phone_specific_folder, f"<PHONE>_<SPEAKER-NUM>_<RECORDING_NUM>"))


## Text2Phone

### Iterating over all audio sources to get phonetics

In [11]:
SPEAKERS = [f"s0{i}" if i != 10 else f"s{i}" for i in range(1,11)]
# Check the number of iterations


SyntaxError: invalid syntax (2750792505.py, line 1)