---
---

# Feature Engineering

---
---

_The assertions and methodologies outlined in this notebook are substantiated by referenced scientific studies detailed in the README file._

Load libraries and Data

In [1]:
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
import sys
sys.path.append("../src")
from data_manager import load_audio_files, filter_data_based_on_accents
from load_config import load_constants_from_yaml
from custom_transformers.split_silence_transformer import SplitSilenceTransformer
from custom_transformers.mfcc_transformer import MfccTransformer
from expander_transformer import ExpanderTransformer

In [2]:
constants = load_constants_from_yaml('../constants.yml')

SAMPLING_RATING = constants["SAMPLING_RATING"]
FRAME_LENGTH_ENERGY = constants["FRAME_LENGTH_ENERGY"]
THRESHOLD_PERCENTAGE = constants["THRESHOLD_PERCENTAGE"]
MIN_SILENCE_DURATION = constants["MIN_SILENCE_DURATION"]
HOP_LENGTH = constants["HOP_LENGTH"]
SEGMENT_DURATION = constants["SEGMENT_DURATION"]
SEGMENT_OVERLAP = constants["SEGMENT_OVERLAP"]
N_MFCC = constants["N_MFCC"]
CONSIDERED_ACCENTS = constants["CONSIDERED_ACCENTS"]

In [3]:
df = load_audio_files("../data/raw/recordings/", sr=SAMPLING_RATING)
df = filter_data_based_on_accents(df=df, considered_accents=CONSIDERED_ACCENTS)

Trim silence from audio

In [4]:
split_tranformer=SplitSilenceTransformer(
    variables=['audio', 'labels'],
    sampling_rating=SAMPLING_RATING,
    threshold_percentage=THRESHOLD_PERCENTAGE,
    min_silence_duration=MIN_SILENCE_DURATION,
    frame_length_energy=FRAME_LENGTH_ENERGY,
    hop_length=HOP_LENGTH
)
split_tranformer.fit(df)

In [5]:
df=split_tranformer.transform(df)

In [6]:
df.shape

(242, 2)

---

One of the most commonly used spectral feature representations is the Mel-frequency cepstral coefficients (MFCC). MFCC features are generally employed in automatic speech recognition (ASR) and accent recognition systems and are known to perform best in shallow models. Spectrograms, on the other hand, are more effective in deep models and are sometimes utilized in accent recognition. We will extract MFCCs using the Librosa library.

Extract MFCC features from trimmed audio data (not on segmented audio data because the function itself will split the audio data into segments).

In [7]:
print(df.shape)
df.head

(242, 2)


<bound method NDFrame.head of                                                  audio   labels
0    [-0.00081889424, -0.0012332641, -0.0010821958,...  english
1    [-3.3004353e-06, 2.3220142e-05, -5.8616065e-06...  english
2    [1.5094573e-05, -4.2987816e-07, 2.1244243e-06,...  english
3    [0.0027380765, 0.0043952055, 0.004028226, 0.00...  english
4    [0.00018491458, -9.82639e-05, 6.523135e-05, -7...  english
..                                                 ...      ...
237  [0.0066564586, 0.0095736515, 0.008673913, 0.00...  english
238  [-5.1107454e-06, 1.4583517e-05, 2.057516e-05, ...   arabic
239  [-0.00048380543, -0.00065112824, -5.683335e-05...  english
240  [-1.4131354e-05, 2.5187623e-05, -1.1105545e-05...  english
241  [-0.0004948875, -0.0009152264, -0.0009689817, ...  english

[242 rows x 2 columns]>

In [8]:
mfcc_transformer=MfccTransformer(
    variables=["audio", "labels"],
    sampling_rating=SAMPLING_RATING, 
    n_mfcc=N_MFCC,
    duration=SEGMENT_DURATION,
    overlap=SEGMENT_OVERLAP
)
df=mfcc_transformer.fit_transform(df)
print('df shape : ',df.shape)
print('df.columns : ',df.columns)

df shape :  (242, 15)
df.columns :  Index(['audio', 'labels', 'mfcc_1', 'mfcc_2', 'mfcc_3', 'mfcc_4', 'mfcc_5',
       'mfcc_6', 'mfcc_7', 'mfcc_8', 'mfcc_9', 'mfcc_10', 'mfcc_11', 'mfcc_12',
       'mfcc_13'],
      dtype='object')


---

As highlighted in the [Exploratory Data Analysis (EDA) notebook](EDA.ipynb), accent recognition can be enhanced by focusing on specific intervals rather than analyzing the entire audio signal. To achieve this, we'll employ a transformer that expands the MFCCs features. Each feature will be represented in its own row along with its corresponding label (accent).

In [9]:
#expander_transformer= ExpanderTransformer(n_mfcc = N_MFCC)
#df_expanded = expander_transformer.fit_transform(df)

In [10]:
import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator

class TestExpanderTransformer(TransformerMixin, BaseEstimator):
    
    def __init__(self, n_mfcc):
        columns_to_expand = []
        for i in range(n_mfcc):
            columns_to_expand.append(f"mfcc_{i+1}")
        self.columns_to_expand = columns_to_expand

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X=X.copy()
        X_expanded=pd.DataFrame(columns = X.columns)
        columns_to_repeat = [col for col in X.columns if col not in self.columns_to_expand]
        for index, row in X.iterrows():
            dict_repeated={}
            print(f'we are in index {index}')
            for col in columns_to_repeat:
                print(f'col :{col}')
                dict_repeated[col]=row[col]
                
            expanding_length = len(row[self.columns_to_expand[0]])
            row_to_append={}
            for i in range(expanding_length):
                print(f'i is {i}')
                for col in self.columns_to_expand:
                    row_to_append[col] = row[col][i]
                row_to_append.update(dict_repeated)
                pd_row = pd.DataFrame(row_to_append)
                X_expanded = pd.concat([X_expanded, pd_row])

        return X_expanded

In [11]:
expander_transformer = TestExpanderTransformer(n_mfcc = N_MFCC)

In [12]:
df_test=df.head(6)

In [13]:
df_expanded = expander_transformer.fit_transform(df_test)

we are in index 0
col :audio
col :labels
i is 0
i is 1
i is 2


  X_expanded = pd.concat([X_expanded, pd_row])


i is 3
i is 4
i is 5
i is 6
i is 7
i is 8
i is 9
i is 10
i is 11
i is 12
i is 13
i is 14
i is 15
i is 16
i is 17
i is 18
i is 19
i is 20
i is 21
i is 22
i is 23
i is 24
i is 25
i is 26
i is 27
i is 28
i is 29
i is 30
i is 31
i is 32
i is 33
i is 34
i is 35
i is 36
i is 37
i is 38
i is 39
i is 40
i is 41
i is 42
i is 43
i is 44
i is 45
i is 46
i is 47
i is 48
i is 49
i is 50
i is 51
i is 52
i is 53
i is 54
i is 55
i is 56
i is 57
i is 58
i is 59
i is 60
i is 61
i is 62
i is 63
i is 64
i is 65
i is 66
i is 67
i is 68
i is 69
i is 70


: 

In [None]:
#print(df_expanded.shape)
#df_expanded['mfcc_1']

---