# The MLEnd Deception Model

# 1 Author

**Student Name**:  Anabia Aijaz <br>
**Student ID**:  



# 2 Problem formulation


The problem involves developing a predictive model to determine a 30-second audio recording is true or deceptive. This is challenging because it not only involves analysing complex audio signals but also pose issues like background noise and inconsistent recording quality, variability in speakers' accents and styles, limited data, the complexity of extracting reliable acoustic features, and potential biases because of limited diversity in the dataset. However, this project is exciting because it could lead to further research in lie detection models that might be useful in areas like security and law enforcement agencies.

# 3 Methodology


The methodology involves mapping audio features (predictor attributes) to the corresponding labels (true or deceptive) by using classification supervised learning techniques. For this purpose the MLEnd deception dataset is used, which consisit of 100 audio recordings with three attributes namely a complex audio signal, descriptive labels and binary labels. A dataset is also provided with audio file name, language and labels however for this project it was not used.  For our convience First we will split the audio recordings as per the input requirement of 30-sec duration. However, this recordings cannot be fed directly to our model as audio signals are complex datatypes having continuous waveforms that represent sound over time. To convert them into a digital form, they are sampled at a specific rate, called the sampling frequency, which determines how many data points (samples) are recorded per second. Each sample captures the amplitude of the sound wave at a specific moment.

Using librosa.load(sr=None), we can get the original audio sampling frequency which is 44,100 Hz, meaning there are 44,100 samples per second. For a 30-second recording, this gives us 1,323,000 samples. If we feed this directly into our model, we would be working with a high-dimensional dataset, which can lead to issues like overfitting due to insufficient samples for training. To manage this, we will be extracting key features after which we perform the training, validation and test task. 

**Training task:** Trained the model to recognise patterns or relationships from the extracted features of our audio data by adjusting its parameters based on labeled examples in the training dataset. 

**Validation Task:** Evaluated different model's performance and their hyperparameters on unseen data during training to identify the best-performing model, avoid overfitting and helps in model optimization. 

Furthermore, K-fold cross-validation is used to reduce reliance on a single validation set. The dataset is divided into 10 folds; the model is trained on 9 folds and tested on the 10th, repeating this process for all folds. This ensures efficient use of available data and unbiased evaluation.

**Test Task:** Estimated the deployment quality of a model on unseen data. It is performed only once after building our model to see how the model behaves in real world scenario on unknown data.


The model can be evaluated based on the following quality metrics to detect whether a recording is true or deceptive. Since the dataset has true as 0(negative) and deceptive as 1(positive) so the interpretation of the quality metrics are based on that. However, later in our code we displayed the metrics for both classes seperately

**Accuracy:** Proportion of correctly classified samples both true and deceptive. <br>
**Error Rate:** Proportion of misclassified samples both true and deceptive. It is the complement of accuracy ie the model failure rate so we will not be calculating it seperately for our models. <br>
â€‹**Confusion Matrix (Count/Rate):** A table that summarizes the number/rate of true positives(correctly predicted deceptive(1)), true negatives(correctly predicted true(0)), false positives(incorrectly predicted deceptive(1)), and false negatives(incorrectly predicted true(0)) <br>
**Recall (Sensitivity):** Proportion of actual deceptive samples correctly classified as deceptive. It helps to minimize the risk of missing a deceptive recording (false negatives) <br>
**Specificity:** Proportion of actual true samples correctly classified. It helps assess how well the model avoids incorrectly identifying a true recording as deceptive. <br>
**Precision:** Proportion of predicted deceptive samples that are actually deceptive. It helps assess that when the model predicts a recording as deceptive, it is indeed deceptive and not a false alarm <br>
**F1 Score:** Harmonic mean of precision and recall, balancing false positives(deceptive) and false negatives(true). <br>

**Example Interpretation**
For eg in scenarios such as security and law enforcement applications, a high recall means real threats or crimes are identified whereas a high precision ensures that the system is accurate in identifying only real threats or crimes, avoiding wasting resources. However, the high precision might come at the cost of missing some positive cases (low recall), meaning that some true threats or crimes might go undetected. Therefore a F1 Score balances them both in which a High F1-score indicates that it can identify threats or criminal activities without overwhelming the system with false alarms.


For our analysis we used Accuracy, Confusion Matrix (Count/Rate), Recall, Precision and F1 Score as the quality metrics.

# 4 Implemented ML prediction pipelines

The ML prediction pipelines involves the following stages:<br>
**Input:** Audio recording of varying duration from the MLEnd deception dataset <br>
**Transformation stage:** <br>
    **-Segmentation:** Split the audio recordings into 30-second segments. The input will be the raw audio recording and the output will be multiple 30-second  with the same label of segments  as the raw audio file. A pandas dataframe will store the metadata namely the path ['X_paths'], label ['Y', 'Y_encoded'], file name ['FileID'] and the segment name['FileSegment'] <br>  
    **-Feature Extraction:** Extract features from each segment. The input will be multiple 30 segments audio file from which four features namely power, pitch mean, pitch standard deviation and ratio of voiced frames will be extracted. The output is a stored again in the same dataframe containg the metadata from the segmentation.   <br>
    **-Normalization/Scaling:** Normalize the extracted features using the StandardScaler function to ensure that the features have a mean of 0 and a standard deviation of 1. This makes the features comparable and prevents features with larger numerical ranges from dominating the model's learning process.<br>

**Model stage:** <br>
    **-Model Training:** Train multiple models (e.g. Logistic regression, Decision Trees, SVM, K-Nearest Neighbors (KNN)) on the transformed training data.<br>
    **-Model Evaluation:** Evaluate the performance of individual models using metrics like accuracy, precision, recall, or F1 score on validation data<br>

**Ensemble stage:** Combine the predictions from multiple models using different ensemble techniques and evaluate their performance <br>  


## 4.1 Transformation stage

Initial analysis of the the audio recordings in the MLEnd deception dataset hows the recordings are beyond the 30-sec duration.

**4.1.1 Segmentation:** The function **split_audio_files** is built to split the audio files into multiple segments of 30sec with an overlap of 50% ie 15 sec. The overlap ensures that the pattern at the borders are not missed as well as increasing the number of samples in the training and test set. The labels are propogated ahead from the original audio file to each of its segments. For eg for the Aduio file 00001.wav with duration 122.17 seconds the segments are as follows:   
segment 1: 00001_01.wav 0 - 30sec &emsp;segment 2: 00001_02.wav 15 - 45sec &emsp;segment 3: 00001_03.wav 30 - 60sec  
segment 4: 00001_04.wav 45 - 75sec &emsp;segment 5: 00001_02.wav 60 - 90sec &emsp;segment 6: 00001_03.wav 75 - 105sec   
segment 7: 00001_07.wav 90 - 120sec &emsp;segment 8: 00001_08.wav 105 - 122.17sec (which is less than 30sec so discarded)

Total 7 segments are created from the audio file 00001.wav. 

Similarly all the audio files in the training and test set will be split to 30sec duration and stored in the specified path. Information on these audio files will ne stored sseperately in a pandas dataframe with columns namely path ['X_paths'], label ['Y', 'Y_encoded'], file name ['FileID'] and the segment name['FileSegment'] and will be returned from the function.

In [None]:
# Function to split audio files into 30-second segments with 15-second overlap
def split_audio_files(raw_data, destination_folder, window_duration=30, overlap=15):
    """
    This function splits audio files into smaller segments of fixed duration with a specified overlap.
    The segments are saved as separate audio files in the provided destination folder. Metadata for each
    segment (such as file path, label, and segment ID) is stored in a DataFrame and returned.

    Parameters:
    raw_data (pandas DataFrame): A DataFrame containing metadata for the audio files. 
    destination_folder (str): Path to the folder where the audio segments will be saved.
    window_duration (int, optional): Duration (in seconds) of each segment (default is 30 seconds).
    overlap (int, optional): Duration (in seconds) of the overlap between consecutive segments (default is 15 seconds).

    Returns:
    segment_df (pandas DataFrame): A DataFrame containing the metadata of the audio segments with File and Segment names

    """
    
    os.makedirs(destination_folder, exist_ok=True)  # Create folder if it doesn't exist

    # Create a list to store segment details
    segment_data = []

    # Process each audio file in the DataFrame
    for i, row in raw_data.iterrows():
        file_path = row['X_paths']
        label = row['Y']
        label_encoded = row['Y_encoded']
        fileID = f"{i+1:05d}"  # Unique ID for each file

        # Debug prints to verify the file processing
        print(f"\nProcessing file: {file_path}")
        print(f"FileID: {fileID}, Label: {label}, Encoded Label: {label_encoded}")

        # Load the audio
        x, fs = librosa.load(file_path, sr=None)  # Use original sampling rate
        print(f"Audio duration: {len(x) / fs:.2f} seconds")

        # Calculate window and step size in samples
        window_length = int(window_duration * fs)
        step_size = int((window_duration - overlap) * fs)

        # Sliding window logic
        start_index = 0
        segment_count = 0

        while start_index + window_length <= len(x):  # Only process if full window fits
            end_index = start_index + window_length
            segment = x[start_index:end_index]  # Extract the segment
            start_index += step_size
            segment_count += 1

            # Create the segment file name
            segment_name = f"{fileID}_{segment_count:02d}.wav"
            segment_path = os.path.join(destination_folder, segment_name)
            sf.write(segment_path, segment, fs)  # Save the segment to the destination folder

            # Append segment details to the list
            segment_data.append({
                'X_paths': segment_path,
                'Y': label,
                'Y_encoded': label_encoded,
                'FileID': fileID,
                'FileSegment': f"{fileID}_{segment_count:02d}"
            })
            #print(f"Saved segment: {segment_path}")
        
        # After processing all segments for the current file, print the total number of segments
        print(f"Total number of segments: {segment_count}")


    # Create a DataFrame from the segment details
    segment_df = pd.DataFrame(segment_data)
    return segment_df


****4.1.2 Feature Extraction**:

Complex input data types such as the Audio signals has hundreds and thousands of dimensions. Therefore if we feed our model the raw input signal which has fewer samples as compared to the features then we will encounter the curse of dimentionality. Thus, we will extract some useful features and use them as predictors to train our model. They features extracted are as follows:

Power: The average energy level of the audio signal, indicating the loudness of the speech.It is calculated using the formula

$$\text{Power} = \frac{1}{N} \sum_{i=1}^{N} x[i]^2$$

Where:

x[i] = Amplitude of the i-th audio sample.
N = Total number of samples in the audio signal


Pitch - Mean: The average frequency of the speaker's voice, reflecting the overall tone or pitch level. It is calculated using the librosa.pyin function to get the pitch of each frame and then taking its mean.

Pitch - Standard Deviation: The variability in pitch over the audio, capturing fluctuations in tone or intonation. It is also calculated using the librosa.pyin function to get the pitch of each frame and then taking its standard deviation.

Fraction of Voiced Region: The proportion of the audio duration containing voiced (spoken) sounds, as opposed to silence or unvoiced noise. It is also calculated using the using the binary voiced_flag array in librosa.pyin function and then taking the ratio of of voiced regions in the audio.


The function **extract_features** and **getPitch** from the starter kit are used to extract features from the audio signal. Additionally these features are stored in a pandas.dataframe along with their corresponding labels and ID and exported as a csv file for ease of use during experimentation.


In [None]:
def getPitch(x, fs, winLen=0.02):
    """
    This function estimates the pitch and voiced/unvoiced flags from an audio signal
    Parameters:
    x (numpy array): The input audio signal (time-domain samples).
    fs (int): The original sample rate (sampling frequency) of the audio signal in Hz.
    winLen (float): The window length (in seconds) for pitch estimation (default is 0.02 seconds).

    """

    
    # Calculate the frame length in samples based on the given window length (winLen) and sampling rate (fs)
    p = winLen * fs
    
    # Adjust frame_length to the nearest power of 2 greater than or equal to p
    frame_length = int(2**int(p - 1).bit_length())
    
    # Set hop length to half the frame length for overlapping frames
    hop_length = frame_length // 2
    
    # Use librosa's pyin function to estimate the pitch (f0), voiced/unvoiced flag, and voiced probabilities
    # fmin and fmax specify the minimum and maximum pitch range (80 Hz to 450 Hz)
    # sr is the sample rate, frame_length and hop_length control the time resolution of the pitch estimation
    f0, voiced_flag, voiced_probs = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs,
                                                 frame_length=frame_length, hop_length=hop_length)
    
    # Return the estimated pitch (f0) and the voiced/unvoiced flags
    return f0, voiced_flag


def extract_features(data, scale_audio=False):
    """
    This function extracts audio features (such as power, pitch mean, pitch standard deviation, and voiced fraction)
    from a dataset of audio files.
    Parameters:
    data (pandas DataFrame): A DataFrame containing metadata of the audio files
    scale_audio (bool): A flag to scale the audio signal (default is False). If True, the audio signal will be
                        normalized to the range [-1, 1].
    Returns:
    feature_df (pandas DataFrame): A DataFrame containing the extracted features for each audio file along with
                                   its metadata.
    """                        

   
    # Create empty lists to store the extracted features and metadata
    features = []

    # Iterate through each row in the data DataFrame
    for index, row in tqdm(data.iterrows(), total=data.shape[0]):
        file_path = row['X_paths']

        # Load the audio file
        x, fs = librosa.load(file_path, sr=None)  # Use the default sampling rate

        if scale_audio:
            x = x / np.max(np.abs(x))  # Scale the audio if required

        # Extract pitch and voiced flag
        f0, voiced_flag = getPitch(x, fs, winLen=0.02)

        # Extract features
        power = np.sum(x ** 2) / len(x)
        pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0)) < 1 else 0
        pitch_std = np.nanstd(f0) if np.mean(np.isnan(f0)) < 1 else 0
        voiced_fr = np.mean(voiced_flag)

        # Add the features and metadata as a row to the list
        feature_row = {
            'X_paths': row['X_paths'],
            'Y': row['Y'],
            'Y_encoded': row['Y_encoded'],
            'FileID': row['FileID'],
            'FileSegment': row['FileSegment'],
            'power': power,
            'pitch_mean': pitch_mean,
            'pitch_std': pitch_std,
            'voiced_fr': voiced_fr
        }

        features.append(feature_row)

    # Create a new DataFrame from the list of features
    feature_df = pd.DataFrame(features)
    return feature_df


**Data normalisation:**

The extracted attributes have very different ranges and so the features with very large values can lead to giving more weight to them in the model. hence, the attributes are scaled to that they all belong to similar ranges using the StandardScalar function.

## 4.2 Model stage

Supervised learning techniques: The MLEnd deception dataset contained labelled samples, therefore different supervised learning models are chosen. They are built using a dataset of labelled examples by maping the predictor attributes to their corresponding label attribute. Non parametric approaches are used as they offer flexible decision boundary and capacity to handle complex data. The following techniques are experimented


**Support Vector Machine (SVM):**

Support Vector Machines are supervised learning models used for classification task. They aim to find the optimal hyperplane (decision boundary) that separates data points into different classes with the maximum margin.The support Vectors are the data points closest to the hyperplane and helps in defining the margin. The following key hyperparameter are optimised using grid search: <br>
-Soft Margin (C): Regularization parameter that balances margin maximization and classification error.<br>
-Gamma: It defines how far the influence of a single training example reaches.<br>
-Radial Basis Function (RBF) Kernel: Itenables handling of non-linear relationships effectively by mapping the features into a higher-dimensional space.<br>
SVM is suitable for this task as they perform well in high-dimensional feature spaces, can handle non-linear relationships and robust to noise through focus on focus on support vectors.


**K-Nearest Neighbors (KNN):**

KNN is a supervised learning algorithm used for classification, which predicts the class of a data point based on the majority class of its 
k nearest neighbors in the feature space. It is chosen for its simplicity, adaptability to complex decision boundaries, and reliance on local patterns in the data for accurate classification. Key hyperparameters include:

-Number of Neighbors (k): Determines how many neighbors are considered for classification.<br>
-Distance Metric: Common metrics include Euclidean, Manhattan, or Minkowski distance, which measure proximity between data points.<br>
-Weighting: Neighbors can be weighted uniformly or inversely proportional to their distance from the query point.<br>

**Decision Tree:** is a supervised learning algorithm used for classification, which splits the data into subsets based on feature values, creating a tree-like model of decisions. It is chosen for their interpretability, ability to handle non-linear relationships, and suitability for datasets with mixed feature types. Key hyperparameters include:

Max Depth: Limits the depth of the tree to control overfitting. <br>
Min Samples Split: The minimum number of samples required to split a node. <br>
Criterion: Determines the measure of impurity (e.g., Gini index or entropy). <br>
Further parameters will be explored after experimentation with these



## 4.3 Ensemble stage

Ensemble methods helps to create a new model that combines the strengths of diverse base models. The following ensemble techniques are used

**Random Forest** is an ensemble of decision trees that builds multiple decision trees by randomising the training samples and the predictors during training. It then averages the individual predications to get predictions. Tt is an an chosen for its robustness, ability to handle high-dimensional and imbalanced datasets, and reduced risk of overfitting compared to a single decision tree. Key hyperparameters include:

Number of Trees (n_estimators): The total trees in the forest. <br>
Max Features: The number of features considered for splitting at each node. <br>
Max Depth: Limits the depth of each tree to control complexity and overfitting. <br>
Further parameters will be explored after experimentation with these and explained with code <br>

# 5 Dataset

The following datasets have been created to build and evaluate the models using the MLEnd deception dataset.

Test Dataset: The provided dataset of 100 audio recordings are split in 80-20 ratio of train and test set. The spilt is done using stratified spiltting to ensure equal representation of both labels in the test set. This ensures that both the labels are tested equally. Once done, the test dataset is kept aside for the final testing.

Train Dataset: The training datset comprises of 80 samples of the audio recording with equal number of true and deceptive labels ie 40 true and 40 deceptive stories to attain a balance dataset. This is done to provide same opportunity for our model to learn from both label categories and to avoid any bias in training our model more on one label leading to underperformance of the other label. Segmentation and feature extraction is performed after which it is further split in validation dataset.

Validation Dataset: The train data is further split in a validation dataset of 20%. The validation split is done later so that we dont have to do the audio segmentation and feature extraction for the validation set again and again for different experiments. 
One limitation we have is that our train and validation set may contain segment from the same audio file leading to data leakage and bias predictions. This is avoided using grouped splitting of validation set as per file ID however due to complexity involved in grouped splitting for each fold in k-fold cross validation, the k-fold cross validation is not performed.

Below we are loading the MLEnd deceptive data and creating the datasets

In [None]:
import subprocess
import os

# Function to install packages silently
def install_package(package):
    subprocess.run(["pip", "install", package], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# Install the packages
install_package("mlend==1.0.0.4")
install_package("pydub")
install_package("librosa")

In [None]:
import sys
import warnings
import re
import pickle
import glob
import urllib.request
import zipfile
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pydub
import librosa
import soundfile as sf
from scipy.io import wavfile
import IPython.display as ipd
from tqdm import tqdm
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, GroupShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report,confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report, roc_auc_score
import mlend
from mlend import download_deception_small, deception_small_load



In [None]:
# Get the current working directory
current_directory = os.getcwd()

# Define the destination folder
destination_folder = os.path.join(current_directory, "MLEnd")
subdirectory = 'MLEnd/deception'

# Create the directory if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

#datadir = download_deception_small(save_to=destination_folder, subset={}, verbose=1, overwrite=False) #to download data
datadir = os.path.join(current_directory, subdirectory) #to use already download data in the current directory


In [None]:
#Load the data
DataSet, _, MAPs = deception_small_load(datadir_main=datadir, train_test_split=None, verbose=1, encode_labels=True) #Loading the whole dataset without splitting in train and test test

# Print the data types of the loaded data
print(f"Datatype of DataSet: {type(DataSet)}")
print(f"Datatype of MAPs: {type(MAPs)}")

In [None]:
warnings.filterwarnings("ignore", category=ImportWarning)

pd.set_option('display.max_colwidth', None)

# Converting the dictionary to a pandas DataFrame
DataSet = pd.DataFrame(DataSet).sort_values(by='X_paths', ascending=True)
DataSet.head()

The provided dataset contains the following details <br>
'X_paths': corresponds to the location of the audio file<br>
'Y': story type (true or deceptive)<br>
'Y_encoded': binary label for the story type.

# 5.1 Test Dataset

Building the Test dataset:

In [None]:
# Splitting the dataset in train and test sets before building the pipeline to prevent any data leakages

# Separate labels for a stratified split
X = DataSet.drop(columns=["Y_encoded"])
y = DataSet["Y_encoded"]

# Perform stratified split for a balanced train and set spilt
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=22)

# Combine X and y back into DataFrames
TrainData = pd.concat([X_train, y_train], axis=1)
TestData = pd.concat([X_val, y_val], axis=1)

# Resetting the index for both train and test sets
#TrainData = TrainData.reset_index(drop=True)
#TestData = TestData.reset_index(drop=True)

# No of samples in TrainData
train_counts = Counter(TrainData['Y_encoded'])
print(f"TrainData - Number of true stories: {train_counts[0]}, Number of deceptive stories: {train_counts[1]}")

# No of samples in TestData
test_counts = Counter(TestData['Y_encoded'])
print(f"TestData - Number of true stories: {test_counts[0]}, Number of deceptive stories: {test_counts[1]}")


# 5.1 Train Dataset
After the split, the train set comprises of 80 samples in the TrainData which we can view below

In [None]:
# Display the DataFrame
TrainData.head()

In [None]:
# Analysing the first 5 audio files in TrainData
for idx, row in TrainData.head(5).iterrows():  # iterating directly over the first 5 rows
    audio_path = row['X_paths']  # Accessing the path from the row
    print(f"Audio file path: {audio_path}")
    
    # Accessing the label
    audio_label = row['Y']
    
    # Display the audio
    ipd.display(ipd.Audio(audio_path))
    
    # Displays the file name and label of the played audio
    file_name = audio_path.split('/')[-1]
    print(f"File name of the played audio: {file_name}")
    print(f"Label of the played audio: {audio_label}\n")

As can be seen, the duration of the audio recordings is greater than the required 30-sec therefore we will use the split_audio_files function to extract 30-sec segments with a 50% overlap and save it in a seperate folder named 'train_segments'

In [None]:
train_folder = os.path.join(current_directory, "train_segments")

# Split and save segments for the training data
TrainSegments = split_audio_files(TrainData, train_folder)

In [None]:
#Display the resulting DataFrame
print("\nSegmented DataFrame:")
TrainSegments.iloc[:, 0:5]

In [None]:
#Check the balance of dataset

# Count of true and deceptive stories
counts = TrainSegments['Y_encoded'].value_counts()
count_ones = counts[1] if 1 in counts else 0
count_zeros = counts[0] if 0 in counts else 0

# Total count
total_count = count_ones + count_zeros

# Calculate Percentages
percentage_ones = (count_ones / total_count) * 100 if total_count > 0 else 0
percentage_zeros = (count_zeros / total_count) * 100 if total_count > 0 else 0

print(f"\nTotal segments: {total_count}")
print(f"Number of segments of true stories: {count_zeros} ({percentage_zeros:.2f}%)")
print(f"Number of segments deceptive stories: {count_ones} ({percentage_ones:.2f}%)")

The distribution of the classes is relatively close to 50:50. Hence no significant issue for class imbalance.

In [None]:
# Extract features and create a new DataFrame
TrainSegments_with_features = extract_features(TrainSegments, scale_audio=True)

# Define the CSV file path in the current working directory
csv_path = os.path.join(current_directory, 'TrainSegments_with_features.csv')

# Save the new DataFrame to a CSV file
TrainSegments_with_features.to_csv(csv_path, index=False)
#print(f"DataFrame saved to {csv_path}")

# Read the CSV file from the current working directory
TrainSegments_with_features = pd.read_csv(csv_path)

# Display the new DataFrame
TrainSegments_with_features.head()


In [None]:
# Selecting the features for the pairplot
features = TrainSegments_with_features.iloc[:, 5:9]

# Adding the label column
features['Y_encoded'] = TrainSegments_with_features['Y_encoded']

# Creating the pairplot
sns.pairplot(features, hue='Y_encoded')


plt.show()


The plots show significant overlap in all the features however there are some slight dense clusters for each of the labels which hopefully will be picked up by the model. The true label histogram(diagonal matrix) for each feature is more than the deceptive due to the slight imbalance after splitting the recordings in segments.

# 5.3 Validaton Dataset


In [None]:
#Splitting the data in validation set

# Extract features, labels, and FileID column
X = TrainSegments_with_features[['power', 'pitch_mean', 'pitch_std', 'voiced_fr']]
y = TrainSegments_with_features['Y_encoded']
groups = TrainSegments_with_features['FileID']  # Group by FileID to maintain separation

# Initialize GroupShuffleSplit
group_splitter = GroupShuffleSplit(test_size=0.2, random_state=22)

# Split data into train and validation sets based on FileID
train_idx, val_idx = next(group_splitter.split(X, y, groups=groups))

# Create training and validation sets
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

# # Extract FileID for verification
# train_file_ids = TrainSegments_with_features.iloc[train_idx]['FileID'].unique()
# val_file_ids = TrainSegments_with_features.iloc[val_idx]['FileID'].unique()

# Check if there's any overlap between the train and validation FileIDs
# overlap = set(train_file_ids).intersection(set(val_file_ids))

# # Print verification result
# if len(overlap) == 0:
#     print("The train and validation sets contain data from different FileIDs.")
# else:
#     print(f"Overlap found between train and validation sets for FileIDs: {overlap}")


# 6 Experiments and results

The transformed data of the Train Dataset will be fed to the models and use the validation data to evaluate them

In [None]:
def plot_confusion_matrices(y_true, y_pred):
    """
    Generate and plot confusion matrices: one with counts and one with ratios.

    Parameters:
    y_true (array-like): True labels.
    y_pred (array-like): Predicted labels.
    """
    # Generate the confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)

    # Calculate the ratio matrix (normalize by column sum)
    conf_matrix_ratio = conf_matrix.astype('float') / conf_matrix.sum(axis=0)

    # Subplots
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Plot the count confusion matrix (Actual as columns, Predicted as rows)
    sns.heatmap(conf_matrix.T, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['True (0)', 'Deceptive (1)'], 
                yticklabels=['True (0)', 'Deceptive (1)'], 
                ax=axes[0])
    axes[0].set_title('Confusion Matrix (Count)')
    axes[0].set_xlabel('Actual')
    axes[0].set_ylabel('Predicted')

    # Plot the ratio confusion matrix (Actual as columns, Predicted as rows)
    sns.heatmap(conf_matrix_ratio.T, annot=True, fmt='.2f', cmap='Greens', 
                xticklabels=['True (0)', 'Deceptive (1)'], 
                yticklabels=['True (0)', 'Deceptive (1)'], 
                ax=axes[1])
    axes[1].set_title('Confusion Matrix (Ratio)')
    axes[1].set_xlabel('Actual')
    axes[1].set_ylabel('Predicted')

    # Show the plots
    plt.tight_layout()
    plt.show()

In [None]:
# Normalize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val) #scale the validation data using the mean and standard deviation values calculated from the training data.

# Support Vector Machine (SVM)

In [None]:

# Train SVM model
svm_model = SVC(kernel='rbf',class_weight='balanced')
svm_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = svm_model.predict(X_val_scaled)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_report(y_val, y_pred))


As can we seen, the model has accuracy is 0.5 ie 50% similar to tossing a coin where is probability of predicting a story as true or deceptive is 50-50 percent and hence we need to improve this. 

In [None]:
# Train SVM model with hyperparameter tuning
svm_model = SVC(C=1, gamma=2, kernel='rbf',class_weight='balanced', probability=True)#using the hyperparameters are given in the starter kit
svm_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = svm_model.predict(X_val_scaled)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_report(y_val, y_pred))
plot_confusion_matrices(y_val, y_pred)


The model accuracy has improved slightly with an overall accuracy of 53%. However it tends to be performing better for the true (0) label over deceptive class (1) as seen in its higher precision (0.59 vs. 0.40) and recall (0.65 vs. 0.35). F1-score for true label indicate a balanced performance.

# K-Nearest Neighbors (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Parameter grid for KNN
param_dist_knn = {
    'n_neighbors': [3, 5, 7, 10],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Initialize KNN model
knn_model = KNeighborsClassifier()

# Initialize RandomizedSearchCV for KNN with 3-fold cross-validation
random_search_knn = RandomizedSearchCV(estimator=knn_model, param_distributions=param_dist_knn, 
                                       n_iter=10, cv=3, n_jobs=-1, random_state=22)

# Fit the RandomizedSearchCV
random_search_knn.fit(X_train_scaled, y_train)

# Get the best parameters and print them
best_params_knn = random_search_knn.best_params_
print(f"\nBest KNN Parameters: {best_params_knn}")

# KNN Model Evaluation
best_knn_model = random_search_knn.best_estimator_
y_pred_knn = best_knn_model.predict(X_val_scaled)

# Evaluate KNN
accuracy_knn = accuracy_score(y_val, y_pred_knn)
print("\nKNN Accuracy:", accuracy_knn)
print("\nKNN Classification Report:\n", classification_report(y_val, y_pred_knn))

plot_confusion_matrices(y_val, y_pred)

The KNN model achieves an accuracy of 49.6% which is also better identifying class 0 (precision: 0.57, recall: 0.60) compared to class 1 (precision: 0.37, recall: 0.35).

In [None]:
# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=22)

# Train the Decision Tree model
dt_model.fit(X_train_scaled, y_train)

# Predict on the validation set
y_pred_dt = dt_model.predict(X_val_scaled)

# Evaluate the model
accuracy_dt = accuracy_score(y_val, y_pred_dt)
print("\nDecision Tree Accuracy:", accuracy_dt)
print("\nDecision Tree Classification Report:\n", classification_report(y_val, y_pred_dt))
plot_confusion_matrices(y_val, y_pred)

So far, the The Decision Tree model achieves the highest accuracy of 59.5%. It is also showing better performance in predicting class 0 (precision: 0.65, recall: 0.69) compared to class 1 (precision: 0.50, recall: 0.45). However they tends to overfit hence we are using random forest as an ensemble of decision tree.

# Ensemble

In [None]:
# ---- Random Forest. Its an ensemble of decision trees.
param_dist_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf_model = RandomForestClassifier(random_state=22)
random_search_rf = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist_rf, 
                                      n_iter=10, cv=5, n_jobs=-1, random_state=22)

random_search_rf.fit(X_train_scaled, y_train)

# Random Forest Model Evaluation
best_rf_model = random_search_rf.best_estimator_
y_pred_rf = best_rf_model.predict(X_val_scaled)
accuracy_rf = accuracy_score(y_val, y_pred_rf)
print("\nRandom Forest Accuracy:", accuracy_rf)
print("\nRandom Forest Classification Report:\n", classification_report(y_val, y_pred_rf))


The Random Forest model achieves an accuracy of 50.4% which again is not a good model. Tuning the hyperparameters by using randomized grid search

In [None]:
# Define hyperparameter space 
param_dist_rf = {
    'n_estimators': [100, 200],  # Number of trees in the forest
    'max_depth': [10, 20, None],  # Maximum depth of the trees (None means no limit)
    'min_samples_split': [2, 5],  # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2],   # Minimum samples required to be at a leaf node
    'max_features': ['sqrt', 'log2'],  # Number of features to consider for splitting at each node (sqrt and log2 are common options)
    'bootstrap': [True, False],   # Whether to sample the data with or without replacement (True for bootstrapping)
    'class_weight': [None, 'balanced']  # Weigh classes inversely proportional to class frequencies (useful for imbalanced data)
}

# Sample 50% of the data for quick tuning
X_tune, _, y_tune, _ = train_test_split(X_train_scaled, y_train, test_size=0.5, random_state=22)

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=22)

# RandomizedSearchCV for faster hyperparameter tuning
random_search_rf = RandomizedSearchCV(
    estimator=rf_model, 
    param_distributions=param_dist_rf, 
    n_iter=20,  # Number of combinations to try
    cv=3,       # No of folds for cross-validation
    scoring='roc_auc',  # Use ROC-AUC as evaluation metric
    random_state=22,
    n_jobs=-1,  # Use all CPU cores for parallel processing
    verbose=2   # Display progress
)

# Fit the model using a subset of the training data
random_search_rf.fit(X_tune, y_tune)

# Get the best model and parameters
best_rf_model = random_search_rf.best_estimator_
print("\nBest Parameters:", random_search_rf.best_params_)

# Predict on the validation set
y_pred_rf = best_rf_model.predict(X_val_scaled)
y_proba_rf = best_rf_model.predict_proba(X_val_scaled)[:, 1]

# Evaluate the model
accuracy_rf = accuracy_score(y_val, y_pred_rf)
roc_auc_rf = roc_auc_score(y_val, y_proba_rf)
print("\nImproved Random Forest Accuracy:", accuracy_rf)
print("\nImproved Random Forest ROC-AUC:", roc_auc_rf)
print("\nClassification Report:\n", classification_report(y_val, y_pred_rf))


Other ensembling techniques were also experimented such as stacking with SVM and KNN as basemodel and Logistic Regression as meta-model, Voting , Boosting using AdaBoost with SVM, AdaBoost with Decision Tree as the base estimator but not significant improvements achieved hence not included in the notebook.

# Final Testing

The final testing is done on Random Forest classifier.

In [None]:

#Running the test data through the transformation stage
test_folder = os.path.join(current_directory, "test_segments")
csv_test_path = os.path.join(current_directory, 'TestSegments_with_features.csv')

TestSegments = split_audio_files(TestData, test_folder)
TestSegments_with_features = extract_features(TestSegments, scale_audio=True)

# Save the new DataFrame to a CSV file
TestSegments_with_features.to_csv(csv_test_path, index=False)

# Read the CSV file from the current working directory
TestSegments_with_features = pd.read_csv(csv_test_path)
TestSegments_with_features

In [None]:
X_test = TestSegments_with_features[['power', 'pitch_mean', 'pitch_std', 'voiced_fr']]
y_test = TestSegments_with_features['Y_encoded']

# Standardize the features (same scaling as during training)
scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)


# Predict with the trained random forest classifier
y_pred_rf_test = best_rf_model.predict(X_test_scaled)


# Evaluate the test model
accuracy_rf_test = accuracy_score(y_test, y_pred_rf_test)
print("\n Deception Model Accuracy using Random Forest:", accuracy_rf_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf_test))
plot_confusion_matrices(y_test, y_pred_rf_test)

# 7 Conclusions

The final evaluation is done on the reserved unseen test data and the following quality metrics were obtained.
A low accuracy of 34.3%, with similar but weak performance for both classes (class 0: precision 0.37, recall 0.32; class 1: precision 0.32, recall 0.37). The F1-scores of 0.34 for both classes indicate that the model struggles to effectively differentiate between them.

Conclusion
- The model doesnot seem to generalise well as the performance was slightly better on the training and validation set. However, considerable low performance on the unseen test set of 0.3. If we perform a not function on our predications, we will end up with better results. 

Suggestions for improvements:
- Enhanced features: Include additional audio features such as speech rate, Mel-frequency cepstral coefficients (MFCCs),
- Advance Models: Use of deep neural networks to capture complex patterns. Further we can utilize transfer learning with pretrained models such as DeepSpeech or Wav2Vec 2.0.
- Cross validation: Use K-fold cross validation to ensure robust model performance
- Hyperparameter optimization: Utilize wide range of grid search or random search for hyperparameter tuning to identify the optimal parameters for the model
- By increasing the amount of training data.


# 8 References

Acknowledge others here (books, papers, repositories, libraries, tools)

Articles:

https://towardsdatascience.com/understanding-audio-data-fourier-transform-fft-spectrogram-and-speech-recognition-a4072d228520

https://medium.com/analytics-vidhya/audio-data-processing-feature-extraction-science-concepts-behind-them-be97fbd587d8

https://medium.com/analytics-vidhya/audio-data-processing-feature-extraction-essential-science-concepts-behind-them-part-2-9c738e6a7f99

https://daehnhardt.com/blog/2023/03/05/python-audio-signal-processing-with-librosa/

https://wiki.cci.arts.ac.uk/books/how-to-guides/page/audio-files-with-librosa

repositories:
https://github.com/alicex2020/Deep-Learning-Lie-Detection

https://github.com/m-shahbaz-kharal/py_lie_detect

https://github.com/craigfrancis/audio-detect

https://github.com/librosa/librosa

https://github.com/hernanrazo/human-voice-detection

https://github.com/Ribin-Baby/Audio-Processing

Libraries:

Python Software Foundation. (n.d.). Python Language Reference. Retrieved from https://www.python.org/

NumPy Developers. (2024). NumPy: The fundamental package for scientific computing with Python. Retrieved from https://numpy.org/

Pandas Development Team. (2024). Pandas: Python Data Analysis Library. Retrieved from https://pandas.pydata.org/

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. IEEE Xplore, DOI: 10.1109/MCSE.2007.55

Waskom, M. L. (2021). Seaborn: statistical data visualization. Retrieved from https://seaborn.pydata.org/

PyDub Contributors. (2024). PyDub: Python library for audio processing. Retrieved from https://pydub.com/

Librosa Contributors. (2024). Librosa: Python package for music and audio analysis. Retrieved from https://librosa.org/

Soundfile Contributors. (2024). Soundfile: Read and write sound files. Retrieved from https://pysoundfile.readthedocs.io/en/latest/

SciPy Contributors. (2024). SciPy: Open source scientific tools for Python. Retrieved from https://scipy.org/

IPython Development Team. (2024). IPython: A rich interactive environment for Python. Retrieved from https://ipython.org/

TQDM Contributors. (2024). TQDM: A fast, extensible progress bar for loops and other operations. Retrieved from https://tqdm.github.io/

Scikit-learn Contributors. (2024). Scikit-learn: Machine Learning in Python. Retrieved from https://scikit-learn.org/

MLend Contributors. (2024). MLend: A library for machine learning in deception detection. Retrieved from https://mlend.readthedocs.io/



