# Dataset: PhonemeSpectra

http://www.timeseriesclassification.com/description.php?Dataset=PhonemeSpectra

### Info from data source:
Phoneme Description:
This data set is a multivaritate representation of a subset of the data used in the paper Dual-domain Hierarchical Classification of Phonetic Time Series. 
In the case of the raw data.
Each series was extracted from the segmented audio collected from Google Translate
Audio files collected from Google translate are recorded at 22050
The speakers are male and female.
After data collection, they segment waveforms of the words to generate phonemes using the Forced Aligner tool from the Penn Phonetics Laboratory.
A Spectrogram of each instance was then created with a window size of 0.001 seconds and an overlap of 90%.
Each instance in this multivariate dataset is arranged such that each dimension is a frequency band from the spectrogram.
The data consists of 39 classes each with 170 instances. 

Phoneme Refference:
Publication: Hamooni H, Mueen A. Dual-domain hierarchical classification of phonetic time series. InData Mining (ICDM), 2014 IEEE International Conference on 2014 Dec 14 (pp. 160-169). IEEE.


### Size:
+ Training samples: 3315
+ Test sampels: 3353
+ Dimension: 217 timepoints x 11 channels
+ Classes: 39


In [6]:
import numpy as np
import os
import sys
import pandas as pd

CODE = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\mcfly\\mcfly'
DATA = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\data\\PhonemeSpectra'
sys.path.append(CODE)

In [7]:
file_train = os.path.join(DATA, 'PhonemeSpectra_TRAIN.arff')
file_test = os.path.join(DATA, 'PhonemeSpectra_TEST.arff')

In [44]:
def load_arff(filename):
    start = 0

    data = []
    labels = []
    start_line = 0
    with open(filename) as fp:
        line = fp.readline()
        count = 0
        while line:
            if start == 1:
                label = line.split("',")[-1]
                labels.append(label)
                line = line.split("',")[0] 
                lines = line.split('\\n')
                data_line = []
                for l in lines:
                    data_line_sub = []
                    #for entry in l.split(','):
                        #data_line_sub.append(entry.replace("'", ""))
                    #data_line.append(data_line_sub)
                    data_line.append([x.replace("'", "") for x in l.split(',')])
                data.append(data_line)

            if line.startswith('@data'):
                start_line = count
                #print("Actual data start in line", start_line)
                start = 1

            line = fp.readline()
            count += 1
            
    return np.swapaxes(np.array(data).astype(float), 1,2), labels

X_train, y_train = load_arff(file_train)
X_test0, y_test0 = load_arff(file_test)

In [36]:
print("X_train.shape", X_train.shape)
print(len(y_train))

print("X_test.shape", X_test0.shape)
print(len(y_test0))

X_train.shape (3315, 217, 11)
3315
X_test.shape (3353, 217, 11)
3353


In [45]:
type(X_train[0,0,0])

numpy.float64

In [46]:
X_train[0,:,10]

array([ 0.60185 ,  0.10432 ,  0.67014 ,  0.15635 ,  0.95577 ,  2.4809  ,
        3.5833  ,  4.7018  ,  1.1286  ,  7.2648  ,  7.1282  ,  5.3625  ,
        4.6666  ,  3.4076  ,  3.4368  ,  3.1312  ,  0.6371  ,  7.2779  ,
       10.702   ,  9.528   ,  8.9655  ,  5.3169  ,  2.2338  ,  0.31894 ,
        1.8213  ,  6.7641  , 10.24    ,  9.8695  ,  7.5672  ,  2.3384  ,
        3.8596  ,  7.3618  ,  3.3751  ,  2.6142  ,  2.605   ,  5.3137  ,
        5.4795  ,  1.0734  ,  1.0891  ,  3.0922  ,  2.4679  ,  0.091312,
        2.8001  ,  6.1137  ,  4.8455  ,  0.27992 ,  3.3654  ,  7.6773  ,
        9.0268  , 12.636   , 12.903   ,  8.7211  ,  8.656   ,  9.1178  ,
        5.2904  ,  3.632   ,  6.6237  ,  6.1359  ,  5.684   ,  5.1734  ,
        5.4562  ,  5.3652  ,  5.2969  ,  4.7929  ,  8.4382  ,  9.1113  ,
        2.4906  ,  1.5931  ,  1.2522  ,  5.8437  ,  8.9623  ,  5.8633  ,
        4.0618  ,  2.3871  ,  0.9758  ,  0.74115 ,  0.95252 ,  2.296   ,
        2.6277  ,  3.1806  ,  5.8372  ,  7.1867  , 

### Plot test into test and validation:

In [49]:
y_val = []
y_test = []
IDs_val = []
IDs_test = []

np.random.seed(1)
for label in list(set(y_test0)):
    idx = np.where(np.array(y_test0) == label)[0]
    idx1 = np.random.choice(idx, len(idx)//2, replace=False)
    idx2 = list(set(idx) - set(idx1))
    IDs_val.extend(idx1)
    IDs_test.extend(idx2)
    y_val.extend(len(idx1) * [label])
    y_test.extend(len(idx2) * [label])

    print(label, y_test0.count(label))
    
X_test = X_test0[IDs_test,:,:]
X_val = X_test0[IDs_val,:,:]

W
 86
NG
 86
JH
 86
AH
 86
Z
 86
AY
 86
IY
 86
S
 86
B
 86
AE
 86
G
 86
ZH
 85
AO
 86
P
 86
V
 86
HH
 86
D
 86
F
 86
OY
 86
K
 86
AW
 86
OW
 86
UH
 86
UW
 86
AA
 86
SH
 86
R
 86
EY
 86
DH
 86
M
 86
ER
 86
Y
 86
T
 86
CH
 86
EH
 86
L
 86
IH
 86
N
 86
TH
 86


In [50]:
print(X_test.shape, X_val.shape)
print(len(y_test), len(y_val))

(1677, 217, 11) (1676, 217, 11)
1677 1676


## Save pre-processed data as numpy files

In [51]:
dataset_name = 'PhenomeSpectra_'

output_path = 'C:\\OneDrive - Netherlands eScience Center\\Project_mcfly\\data\\processed'
np.save(os.path.join(output_path, dataset_name + 'X_train.npy'), X_train)
np.save(os.path.join(output_path, dataset_name + 'X_val.npy'), X_val)
np.save(os.path.join(output_path, dataset_name + 'X_test.npy'), X_test)
np.save(os.path.join(output_path, dataset_name + 'y_train.npy'), y_train)
np.save(os.path.join(output_path, dataset_name + 'y_val.npy'), y_val)
np.save(os.path.join(output_path, dataset_name + 'y_test.npy'), y_test)

## Or: Create new split of data ?