### Data processing

The first thing to do is to import all the necessary libraries:

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

import keras
from keras.utils import to_categorical
from keras.models import Sequential,Input,Model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Conv1D, MaxPooling1D
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import LeakyReLU

Now we can start to load the data and then to precess them. Before starting, let's see in more details our data.
you can find data here: https://surfdrive.surf.nl/files/index.php/s/A91xgk7B5kXNvfJ

- **feat.npy** is an array with Mel-frequency cepstral coefficients extracted from each wav file. The features at index *i* in this array were extracted from the wav file at index *i* of the array in the file path.npy.
- **path.npy** is an array with the order of wav files in the feat.npy array.
- **train.csv** contains two columns: path with the filename of the recording and word with word which was pronounced in the recording. This is the training portion of the data.
- **test.csv** is the testing portion of the data, and it has the same format as the file train.csv except that the column word is absent.

In [None]:
                        # --------------------- LOAD DATA ------------------------ #

features = np.load("feat.npy", allow_pickle = True)
path = np.load("path.npy", allow_pickle = True)
train = pd.read_csv("train.csv", delimiter = ",")
test = pd.read_csv("test.csv", delimiter = ",")

Tu put it simply, in path.npy we have the "wav name" of the recordings and in feat.npy we have its features.

In [None]:
print("The file with name ",path[0], "has this set of features:")
print("----------------------------------------------------------")
print(features[0])

As you can see the index in path and features must be the same, otherwise we are going to misalign "wav file name - features" and consequently, have a very bad accuracy.

Ok, now stay with me, because this is the most boring part of the project. Unfortunatly the data are not set up to work directly with them and we cannot apply ML or DL algortihms (yet), but this is fine, we are Data Scientists and we know that we have to get "our hands dirty" before having fun ;).

The problem is that we already have the split of train set and test set (see train.csv and test.csv). I know, now you are thinking: "Why in the world is this a problem? we need those!". That is true, but the problem is that we just have the "file name" already splitted and not also their features.

In [None]:
print(test.head())
print()
print(train.head())

In conclusion, we need to find a way to split our data, which are divided in path.npy and feat.npy, according to the indices in train.csv and test.csv.

To do that, we first link our path.npy and feat.npy with a dictionary, key = path and value = feat. You will understand why in a moment. 

In [None]:
# create dictionary: key = path and value = feat
dic = {} 
for i in range(len(path)):
    dic[path[i]] = features[i]

After creating a dictionary, we need a fuction that allows us to split all our data (now in the dictionary) into train and test sets according to the csv files. Basically, we go through all the data_frame, which are train.csv and test.csv and thanks to the information into the dictionary we can create a list for each data_frame, where in the *ith* position for both, data_frame and dictionary, we have its *i* set of features. 

In [None]:
# this function take as argument a pandas data frame and a dictionary
# and create a new list according to the ith position in the dataframe

def create_list(data_frame,dic):
    new_list= []
    for i in range(len(data_frame)):
        if data_frame["path"][i] in dic.keys():
            new_list.append(dic[data_frame["path"][i]]) # in the position i we add its features thanks to the dic
    return new_list

#in order to convert a list in a numpy array we need to padd our data
def padding(data):
    zeros_list=[0,0,0,0,0,0,0,0,0,0,0,0,0]
    for example in range(len(data)):
        if data[example].shape[0]!=99:
            to_change=data[example].tolist()
            for adding in range(99-len(to_change)):
                to_change.append(zeros_list)
            data[example]=np.array(to_change)     
    return data
    

Now we can apply all the defined functions and we can also have an overview of the shape so that we have a feeling about what kind of data we are going to deal with.

In [None]:
# split test and train 
training_data = create_list(train,dic)
test_data = create_list(test,dic)

# padding
training_data = padding(training_data)
test_data = padding(test_data)

# convert to array
training_data = np.array(training_data)
test_data = np.array(test_data)

#check shape
training_data.shape,test_data.shape

We did it, we have now a ordered training data and test data, respectively compose of 94824 and 11005 examples with 99 lists of 13 elements each. The new way or linkig all the features that we have now in the variables training_data and test_data is with the train.csv and test.csv