### Import Modules

In [4]:
import pandas as pd
import numpy as np
import librosa
from scipy.stats import skew
from random import randint
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

Insert the dataset you want to make the prediction on inside the folder "Input", then specify its name in the cell below in the field "name_input". The audios' folder must be inserted in the folder "Input", too.

The output file with the predictions will be saved in the "Predictions" folder with the name specified below in the field "name_output".

In [9]:
path_input = "..\\Input\\"
name_input = "evaluation.csv"
input_file = path_input + name_input

path_output = "..\\Predictions\\"
name_output = "submission"
output_file = path_output + name_output

### Import Data

The cell below is meant to import the dataset to predict, extract filepaths and sampling rates, and strip the column "tempo".

In [10]:
dataset = pd.read_csv(input_file, index_col = 0) 
paths = dataset["path"]
dataset["tempo"] = dataset["tempo"].map(lambda x : x.strip("[]")).astype(float) 
sampling_rates = dataset["sampling_rate"]
dataset.drop(columns = ["sampling_rate", "path"], inplace = True)

### Gender One-Hot Encoding
Categorical Features need to be encoded before making any kind of computation on them. Hence, the cell below is supposed to apply the one-hot encoding (preferred to the label one to avoid sorting) to the features "gender".

In [225]:
dataset = pd.get_dummies(dataset, columns = ["gender"])

### Skewness Reduction
Many of the features exhibit skewed distributions, which could potentially affect the performance of the model we aim to build. To address this, we can reduce the skewness by applying a logarithmic transformation. However, before doing so, it is essential to ensure that no feature contains negative or zero values. If necessary, we translate the data to ensure all values are positive, and then apply the logarithmic function element-wise to each entry.

We conducted several trials to assess the impact of the transformation on each distribution, and in cases where the transformation led to undesirable alterations, we chose not to apply it.

In [1]:
def reduce_skew(data):

    if np.any(data <= 0):  
        data = data + np.abs(np.min(data)) + 1e-10  
    
    data = np.log(data)  

    return data

In [227]:
dataset["jitter"] = reduce_skew(dataset["jitter"])
dataset["shimmer"] = reduce_skew(dataset["shimmer"])
dataset["energy"] = reduce_skew(dataset["energy"])
dataset["num_pauses"] = reduce_skew(dataset["num_pauses"])
dataset["zcr_mean"] = reduce_skew(dataset["zcr_mean"])
dataset["tempo"] = reduce_skew(dataset["tempo"])

### Audio Features

There exist several different audio features that can be extracted from a vocal message. The library *librosa* offers a bunch of different methods and techniques to extract those informations from a *wav* file.

After a careful evaluation, we decided to extract these new features:

1. **Duration** measures the length of the audio signal. Its extraction is meant to support all the other features that depend on it.

2. **Spectral Centroid Variance**  measures the fluctuation in the "brightness" of a sound over time. It calculates how much the spectral centroid, which represents the perceived sharpness or brightness of an audio signal, varies across different frames. This feature is useful for capturing the dynamic texture of sound.

3. **Fundamental Frequency (F0)** refers to the lowest frequency of a periodic waveform, often corresponding to the pitch of the sound. It is a key feature in speech, as it helps characterize the tone and pitch variations of a voice. The extracted features related to f0 are: mean, standard deviation, median, 95th percentile, 5th percentile.

4. **Onset per Second** refers to the rate at which speech events (onsets) occur over time. It measures the number of times a significant sound changes, such as a syllable or beat, occurs in one second. In speech, it can indicate the pace of talking or the rhythm of speech. It is similar to tempo, but not exactly the same thing. Moreover, onset per second could be a good alternative to number of words, given that it can be considered as an indicator of the timestamps when the speaker pronounces a new term.

5. **Mel Frequency Cepstral Coefficients (MFCCs)** are a representation of audio signals that captures the perceptually relevant features of sound. They are widely used in speech and audio processing tasks like speech recognition, speaker identification, and emotion analysis.

MFCCs are derived by:

- Dividing the audio signal into short frames.

- Applying the Fourier Transform to convert the signal to the frequency domain.

- Passing the spectrum through a filter bank that mimics the human ear's perception of pitch (mel scale).

- Taking the logarithm of the filter bank energies to reflect human loudness perception.

- Applying the Discrete Cosine Transform (DCT) to decorrelate the features and retain the most important coefficients.

The model uses the first 13 MFCCs, that are usually suggested in speech detection tasks. They are then aggregated by computing the average.

6. **Mel spectrogram**  is a visual representation of an audio signal's frequency content over time, mapped to the mel scale to reflect how humans perceive pitch.

    To create a mel spectrogram:

- The audio signal is divided into short frames.

- A Fourier Transform is applied to compute the frequency spectrum for each frame.

- The power spectrum is passed through a filter bank aligned with the mel scale, which models the human ear's sensitivity to different frequencies.

- The result is a time-frequency representation where the frequencies are spaced according to the mel scale.

The decision to divide the spectrogram into 64 bands and then extract only the first 20 bands is intended to focus the analysis on the frequency range where speech typically resides (we can observe from the dataset that minimum and maximum pitch are usually comprised in an interval that ranges from 140 to 4000 Hz). Mels spectrogram's coefficient are then aggregated by computing the average.

**Durations**

Here audio durations are extracted and the new feature "silence_ratio" is inserted in the dataset.

In [228]:
def get_durations(paths, srs):
    durations = []
    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i]) #open audio file
        durations.append(librosa.get_duration(y = y, sr = sr)) #extract the duration
    return np.array(durations)

In [229]:
dataset["duration"] = get_durations(paths, sampling_rates)
dataset["silence_ratio"] = dataset["silence_duration"] / dataset["duration"]

**Other Audio Features**

Each of the cells below is supposed to extract one of the required audio feature (Spectral Centroid Variance, Fundamental Frequency, Onset per Second, MFCCs, Mel Coefficients) for the dataset.

In [230]:
def spect_variance(paths, srs):
    spect_variances = []
    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i])

        spectral_centroids = librosa.feature.spectral_centroid(y = y, sr = sr)
        spect_variances.append(np.var(spectral_centroids))
    return np.array(spect_variances)

In [231]:
dataset["spectral_centroid_var"] = spect_variance(paths, sampling_rates)

In [232]:
def extract_fundamental_frequency(paths, srs, min_pitch, max_pitch):
    f0_means = []
    f0_stds = []
    f0_medians = []
    f0_05_perc = []
    f0_95_perc = []

    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i])

        f0 = librosa.yin(y, fmin=min_pitch[i], fmax=max_pitch[i], sr=sr)
        f0_means.append(np.nanmean(f0))
        f0_stds.append(np.nanstd(f0))
        f0_medians.append(np.nanmedian(f0))
        f0_05_perc.append(np.nanpercentile(f0, 5))
        f0_95_perc.append(np.nanpercentile(f0, 95))
    return np.array(f0_means), np.array(f0_stds), np.array(f0_medians), np.array(f0_05_perc), np.array(f0_95_perc)

In [233]:
f0_means, f0_stds, f0_medians, f0_05_perc, f0_95_perc = extract_fundamental_frequency(paths, sampling_rates, dataset["min_pitch"], dataset["max_pitch"])
dataset["f0_mean"] = f0_means
dataset["f0_std"] = f0_stds
dataset["f0_median"] = f0_medians
dataset["f0_05_perc"] = f0_05_perc
dataset["f0_95_perc"] = f0_95_perc

In [234]:
def extract_onsets_per_seconds(paths, srs, durations):
    onsets_per_second = []
    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i])

        onset_frames = librosa.onset.onset_detect(y=y, sr = sr, hop_length=512, backtrack=True)
        onsets_per_second.append(len(onset_frames) / durations[i])
    return np.array(onsets_per_second)

In [235]:
dataset["onsets_per_second"] = extract_onsets_per_seconds(paths, sampling_rates, dataset["duration"])

In [236]:
def extract_mfcc(paths, n_mfccs, srs):
    mfcc_means = []
    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i])
        mfccs = librosa.feature.mfcc(y = y, sr = sr, n_mfcc = n_mfccs)
        mfcc_mean = np.mean(mfccs, axis = 1)  # Mean MFCCs
        mfcc_means.append(mfcc_mean)
    return pd.DataFrame(mfcc_means, columns = [f"Mfcc_{i+1}_mean" for i in range(n_mfccs)])

In [237]:
mfccs_means = extract_mfcc(paths, n_mfccs = 13, srs = sampling_rates)
dataset = pd.concat((dataset, mfccs_means), axis = 1)

The function below is not meant to extract a feature but to normalize the lengths of audio vocals by randomly extractin *target_duration* secods from the audio. It will be used to extract mel's coefficients.

In [238]:
def normalize_audio(y, sr, target_duration = 7):
        """Normalizza la lunghezza del file audio a target_duration."""
        target_length = int(target_duration * sr)
        if len(y) > target_length: 
            start_frame = randint(0, len(y) - target_length)
            end_frame = start_frame + target_length
            return y[start_frame : end_frame]
        elif len(y) < target_length: 
            padding = target_length - len(y)
            return np.pad(y, (0, padding), mode='constant')
        return y

In [239]:
def extract_mel(paths, n_mels, srs):
    mels_means = []

    for i, file in enumerate(paths):
        y, sr = librosa.load(path_input + file, sr = srs[i])
        y_normalized = normalize_audio(y, sr, target_duration = 7)

        mel_spectrogram = librosa.feature.melspectrogram(y=y_normalized, sr=sr, n_mels=n_mels)
        log_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
        mels_means.append(np.mean(log_spectrogram, axis=1))

    return pd.DataFrame(mels_means, columns = [f"Mel_{i+1}_mean" for i in range(n_mels)])

In [240]:
mels_mean = extract_mel(paths, n_mels = 64, srs = sampling_rates)
dataset = pd.concat((dataset, mels_mean.iloc[:, :20]), axis = 1)

### Drop Columns
The cell below is meant to drop all the columns corresponding to the feature we decided to discard. It follows a brief list of the reasons that lead us to discard them:

1. *Ethnicity*: Ages are not uniformly distributed across the ethnicities, so making a decision on this attribute leads to unfairly exploit the structure of the training set. To further confirm this thesis, we did also compute the cosine similarity of the centroids between different couples of ethnicities that should not have nothing in common, and this turned to be really close to 1 in basically any of the cases. 

2. *Min Pitch and Max Pitch*: Distributions for males and females and for distinct ages are so similar that could be considered identical: differences are so small that cannot be perceived. Hence, it makes no sense to maintain those attributes in the dataset given that they do not contribute to distinguish speakers.

3. *Num Words and Num Characters*: A manual inspection of the recordings confirm this thesis: in most of the audios, speakers asks to a third person to tell Stella to bring a list of things from a store. There are some other short messages like "I'm from Nigeria" or "I'm hungry and thirsty" scattered in the dataset. Moreover, the inspection revealed that many of the audio present num_words and num_characters equal to 0, even though someone actually speaks. We do not have the certainty but, given that most of 0-characters audio are not in English, we can suppose that something went wrong with the transcription.

4. *Silence Duration*: it has been normalized by the duration before.

In [241]:
dataset.drop(columns = ["ethnicity"], inplace = True)
dataset.drop(columns = ["min_pitch"], inplace = True)
dataset.drop(columns = ["max_pitch"], inplace = True)
dataset.drop(columns = ["num_words"], inplace = True)
dataset.drop(columns = ["num_characters"], inplace = True)
dataset.drop(columns = ["silence_duration"], inplace = True)

In [242]:
#Uncomment these lines if you are using the development dataset.

#dataset.drop(columns = ["age"], inplace = True)
#dataset.to_csv("..\\Datasets\\Preprocessed Data\\Training", index = False)

### Model
We chose the Ridge regression model for our task of predicting ages from vocal features due to its ability to handle datasets with a large number of features effectively. Ridge regression adds an 𝐿2 regularization penalty to the loss function, which helps prevent overfitting by shrinking the coefficients of less important features. This was particularly important in our case, as we were working with a dataset containing many features, some of which might be correlated or not strongly predictive of age. By incorporating regularization, Ridge regression ensures a more robust and generalizable model that balances predictive accuracy with feature stability.

Additionally, we implemented Polynomial Features of degree 2 at the beginning of the pipeline to capture potential nonlinear relationships between the features and the target variable. By combining polynomial feature engineering with regularization, Ridge regression ensures a more robust and generalizable model that balances predictive accuracy with feature stability.
Finally, to ensure all features were on a comparable scale, we applied a Standard Scaler after generating the polynomial features.

The cells below defines the model and fit it with the training set stored in the appropriate folder.

In [None]:
df_train = pd.read_csv("..\\Datasets\\Preprocessed Data\\Training")
age = pd.read_csv("..\\Datasets\\Original Data\\development.csv", usecols = ["age"]).astype(float)

In [7]:
pipeline = Pipeline([
    ("poly" , PolynomialFeatures(2)),
    ("scaler" , StandardScaler()),
    ("ridge" , Ridge(alpha = 178))
])

X_train, y_train = df_train, age.values
pipeline.fit(X_train, y_train);

Finally, the cell below is meant to make the predictions on the given input, saving the resulting file in the output folder.

In [None]:
y_pred = pipeline.predict(df_train).flatten()

predictions = pd.DataFrame({"Id" : [i for i in range(y_pred.size)], "Predicted" : y_pred})
predictions.to_csv(output_file, index = False)