<a href="https://colab.research.google.com/github/ArunGovardhanRajObuli/Projects/blob/main/Machine_Learning_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**1 Author**

 Student Name: Arun Govardhan Raj Obuli

 Student ID: 241024544

# **2 Problem formulation**

The machine learning problem that we are trying to solve here is to identify deceptive stories. Our aim is to classify audio recordings of stories as deceptive or truthful based on features such as pitch, loudness, Mel-Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), spectral centroid, and spectral bandwidth.

This is an intriguing problem because deceptive speech often exhibits subtle variations in these audio features, which can be captured and analyzed. Developing a good deception detection model can be useful in fields like law enforcement, where it could assist in detecting lies during interrogations, and in psychological research, where understanding deception can provide valuable insights into human behavior.

#**3 Methodology**

 **Training task**

First, we extract the relevant features such as pitch, loudness, Mel-Frequency Cepstral Coefficients, Zero crossing rate, spectral centroid and bandwidth from the audio recordings. These features correlate strongly with deceptive speech. Next, we preprocess the features by standardization and dimensionality reduction.

We then train a Support Vector Machine (SVM) model using the preprocessed features. The SVM is configured with class_weight='balanced' to handle class imbalance and ensure fair learning across both classes.

**Test task**

The test task is performed on an independent test set, which is completely separate from the training dataset to ensure model generalization. The test set is created by splitting the dataset using GroupShuffleSplit, ensuring no overlap of stories between the training and test sets. This prevents data leakage, where information from the test set could influence the training process.

**Performance Metrics**

1. Accuracy - This provides the percentage of correct predictions.
2. Classification Report - Gives the precision, recall, F1-score, and support for each class.

**Additional tasks**

1. We split the audio files into overlapping 30 second segments.
2. Using Groupshufflesplit to prevent data leakage(i.e. to make sure the segments of the same story does not appear in the training and testing).

# **4 Implemented ML prediction pipelines**

**Overview**

We take in the raw audio files and labels and perform segmentation, feature extraction, standardization and dimensionality reduction and train/test split making sure the segments of the same story are either in the training or test set. We then train an SVM classifier with class weight balancing for imbalanced data. This trained model is then used to make predictions on the test data and calculate the accuracy, precision, recall, and F1-score.

# 4.1 Transformation stage

**Input**: Audio file path(List of strings) and labels(Pandas DataFrame).

In this stage, we load the audio files at the given sampling rate and check if its above the min duration. Then we calculate the length of each segment and the step size based on the overlap percentage and generate the segements using the sliding window approach. Each segment is 30 seconds long and overlaps with the previous segment by 50%. This ensures that the entire audio is captured without cutting out segments that are lesser than the segment duration. It also helps enhance the feature extraction as overlapping segments can capture subtle shifts in spectral features or energy that might be missed in non-overlapping segments.

We are extracting the features from the audio files as this decreases the complexity and reduces the noise when compared to the raw audio files.

1. **Pitch** : Pitch has been choosen as one of the features here as higher pitch may be an indicator of deception. If the mean pitch of the speaker is significantly higher than the baseline pitch, this might indicate that the speaker is stressed and must be probably lying. High variability in the pitch is an indicator of nervousness and can be related to lying hence we find this variability through calculating the standard deviation.

2. **Intensity** : Higher vocal intensity(loudness) may be an indicator of deception hence we are extracting the RMS value of the audio.

3. **MFCC** : Deceptive speech can cause changes in vocal tract configuration due to stress or anxiety. Mel-Frequency Cepstral Coefficients (MFCCs) can be useful in representing these changes by modelling the frequency content of the speech. We calcuate the MFCC mean and standard deviation in each segement to provide a summary of the segment's spectral shape and variability in the spectral content over time.

4. **Zero crossing rate** : Zero crossing rate can provide insights into frequency content. We are calculating the mean of the ZCR, higher mean suggests stress, nervousness, or unintentional articulation changes during deception.

5. **Spectral Centroid and bandwidth** : We are also extracting spectral centroid and bandwidth which can help capture details related to energy distribution of the audio, which could change due to emotional states associated with deception.

All of the above features are extracted for all segments and appended to a feature vector. The same is done for labels and story Id's.

This feature matrix is taken as input to a standard scaler which standardizes the features so that all features contribute equally to the analysis and also helps in handling the different feature units. PCA is then applied on this standardized feature matrix to reduce its dimensionality and identify the principal components that capture the maximum variance in the data. PCA will help prevent the model from overfitting.

**Output**: A standardized, reduced-dimension feature matrix.


# 4.2 Model stage

**Input**: Transformed Feature matrix from the transformation stage.


We have chosen the SVM model for this pipeline after trying the other models such as random forest and gradient boosting. The other models tend to memorize the training data and overfit due to their high complexity. Here, we are choosing the SVM classifier with the rbf(Radial Basis Function) kernel which is good at handling non-linear relationships as it helps the SVM find a non-linear decision boundary that seperates the classes effectively. Since the feature matrix is high dimensional, SVM's can handle them better as they focus on support vectors rather than the entire dataset. By focusing only on support vectors, SVMs with an RBF kernel are less influenced by outliers compared to models like k-Nearest Neighbors or linear regression. SVM's allow class weighting, which can help in representing the minority class better. Hence, we have chosen the SVM with rbf kernel as our model in this project.

**Output**: Predictions for the data and evaluation metrics.

# 5 Dataset
We are using training and test datasets to evaluate our model.

Since we are segmenting the audio files into multiple segments and the segments of the same audio are to be kept together in either the training set or the test set to prevent data leakage, we use Groupshufflesplit. If the data is split randomly, some of the information from the test set might influence the training set, leading to inaccurate performance metrics. To ensure class balance we are using class_weights= 'balanced' in the SVM to ensure there is equal representation of both classes.

**Training Dataset**

We use the grouped dataset from GroupShuffleSplit to train our model. The training dataset consists of 80% of the MLEnd dataset, ensuring that segments from the same audio file remain within the training set. The training set is used to fit the model and learn meaningful patterns from the data.

**Test Dataset**

We use this dataset to evaluate the performance of our model. The test set consists of the remaining 20% of the MLEnd dataset, ensuring it is completely independent of the training dataset. This guarantees that the test set provides an unbiased evaluation of the model’s generalization ability.

**Limitations**

1. There is a dependance on labels, we have assumed that the labels are accurate while balancing the classes. If the labels are incorrect this will amplify the negative impact while balancing.

2. The balancing might work well with the training data but might fail to generalize unseen test data, especially when the test data has imbalanced distribution.

3. During segmentation we have chosen to overlap the segments to capture the entire audio file, this overlap might lead to redundancy as adjacent segments might share similar features.

4. There is an imbalance in the representation between the classes and this might lead to the model overfitting to that class.

5. As the dataset size is small, it is not enough to train a complex model. This limited data might lead to overfitting.

In [None]:
!pip install mlend==1.0.0.4

import mlend
from mlend import download_deception_small, deception_small_load
import pandas as pd
from google.colab import drive
import glob
from tqdm import tqdm
from sklearn.model_selection import GroupShuffleSplit
import librosa
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings("ignore", category=ImportWarning)

drive.mount('/content/drive')
datadir = download_deception_small(save_to='/content/drive/MyDrive/Data/MLEndDeception', subset={}, verbose=1, overwrite=False)
audio_files = glob.glob('/content/drive/MyDrive/Data/MLEndDeception/deception/MLEndDD_stories_small/*')
labels_file = pd.read_csv('MLEndDD_story_attributes_small.csv').set_index('filename')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Downloading 100 stories (audio files) from https://github.com/MLEndDatasets/Deception
100%|[92m▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓[0m|100\100|00100.wav
Done!


In [None]:
#Transformation Stage

def split_audio(file_path, seg_dur=30, sr=22050, overlap=0.5, min_dur=30):
    x, fs = librosa.load(file_path, sr=sr)
    if len(x) < min_dur * sr: # If the length of x at a given sampling rate is lesser than the minimum duration we skip the file.
        print(f"Skipping {file_path}: File too short ({len(x) / sr:.2f} seconds)")
        return [], fs
    seg_len = int(seg_dur * sr) # The segment duration is 30 seconds as defined in the lecture. Segment length calculates the number of samples per segment
    step_size = int(seg_len * (1 - overlap)) # The number of samples to move forward in the audio signal for each segment when splitting it into overlapping segments.
    segments = [x[start:start + seg_len]
                for start in range(0, len(x) - seg_len + 1, step_size)] # For start in range begining at 0 and incrementing by step_size until there is enough remaining data to extract a segment of size seg_len, we slice the original signal from index start to start+seg_len.
    return segments, fs

def getPitch(x, fs, winLen=0.02): # Input audio(x), sampling rate of the audio signal and the window length for pitch detection. 0.02 seconds is the default size for speech samples.
    frame_length = int(winLen * fs) # Calculating the number of samples in one frame.
    hop_length = frame_length // 2 # Defining the number of samples by which the window is shifted for the next frame.
    f0, voiced_flag, _ = librosa.pyin(y=x, fmin=80, fmax=450, sr=fs, frame_length=frame_length, hop_length=hop_length) # Finding the fundamental frequency for each frame and determining if its voiced or unvoiced. Fmin and fmax are the min and max pitch expected for human speech.
    return f0, voiced_flag

def getIntensity(x):
    return np.mean(librosa.feature.rms(y=x)) # Calculating average RMS energy across all frames in the audio signal, representing the overall intensity of the audio.

def getMFCC(x, fs, n_mfcc=13): #n_mfcc is the number of MFCC coeffecients to extract
    mfcc = librosa.feature.mfcc(y=x, sr=fs, n_mfcc=n_mfcc) # We extract the mfcc features.
    return np.mean(mfcc, axis=1), np.std(mfcc, axis=1) # Calculating the mean and standard deviation of the features and representing them as 1D arrays.

def getZeroCrossingRate(x):
    return np.mean(librosa.feature.zero_crossing_rate(y=x)) # We extract the ZCR and calculate it's mean.

def getSpectralFeatures(x, fs):
    spectral_centroid = librosa.feature.spectral_centroid(y=x, sr=fs) # We extract the spectral centroid.
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=x, sr=fs) # We extract the spectral bandwidth.
    return np.mean(spectral_centroid), np.std(spectral_centroid), np.mean(spectral_bandwidth), np.std(spectral_bandwidth) # Calculate the mean and standard deviation of both the features.

# Function to call the above functions and calculate the different features for each and every segment.
def getXy_multiple_segments(files, labels_file, seg_dur=30, overlap=0.5, min_dur=30, scale_audio=True):
    X, y, story_ids = [], [], [] # Initializing the outputs. X will store the list of feature vectors for all segments, y stores corresponding labels and story_ids store the file ID to track which file the segment came from.
    for file in tqdm(files): # Itirating through the files
        fileID = file.split('/')[-1]
        if fileID not in labels_file.index: # Extracting the file ID and checking if it exists in labels file.
            print(f"Skipping {fileID}: Not found in labels file") # Skipping if not present in labels file.
            continue
        yi = labels_file.loc[fileID, 'Story_type'] == 'true_story' # We retrieve the label for a file and convert it to a boolean value (True/false)
        try:
            segments, fs = split_audio(file, seg_dur=seg_dur, overlap=overlap, min_dur=min_dur) # We split the audio into overlaping segments by calling the split_audio function.
            for segment in segments:
                if scale_audio:
                    segment = segment / np.max(np.abs(segment)) # We itirate through the segment and normalise it to scale it to the rage [-1,1].
                # Extracting features
                f0, voiced_flag = getPitch(segment, fs, winLen=0.02)
                pitch_mean = np.nanmean(f0) if np.mean(np.isnan(f0)) < 1 else 0
                pitch_std = np.nanstd(f0) if np.mean(np.isnan(f0)) < 1 else 0
                voiced_fr = np.mean(voiced_flag) if voiced_flag is not None else 0
                intensity = getIntensity(segment)
                mfcc_mean, mfcc_std = getMFCC(segment, fs)
                spectral_centroid_mean, spectral_centroid_std, spectral_bandwidth_mean, spectral_bandwidth_std = getSpectralFeatures(segment, fs)
                zero_crossing_rate = getZeroCrossingRate(segment)

                xi = [pitch_mean, pitch_std, voiced_fr, intensity,
                      *mfcc_mean, *mfcc_std,
                      spectral_centroid_mean, spectral_centroid_std, spectral_bandwidth_mean, spectral_bandwidth_std,
                      zero_crossing_rate] # We build the feature vector here by combining all the extracted features for the segment.

                if any(np.isnan(xi)) or any(np.isinf(xi)): # Skipping segment if any feature is zero or infinite.
                    continue
                X.append(xi) # We append the feature vector to the main list.
                y.append(yi) # We append the labels to the main list.
                story_ids.append(fileID) # We append the story_ids to the main list.
        except Exception as e:
            print(f"Error processing {fileID}: {e}")
    return np.array(X), np.array(y), story_ids

def transform_features(X, n_components=10):
    # Standardizing the features.
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Applying PCA for dimensionality reduction.
    pca = PCA(n_components=n_components, random_state=42)
    X_pca = pca.fit_transform(X_scaled)

    return X_pca, scaler, pca  # Return fitted scaler and PCA for use on test data.

In [None]:
# Dataset Stage

def split_dataset(X, y, story_ids, test_size=0.2, random_state=42):
    gss = GroupShuffleSplit(test_size=test_size, n_splits=1, random_state=random_state) # Defining the groupshufflesplit with test size as 20%, n_splits defines the number of reshuffling and splitting iterations. Here, it is 1.
    for train_idx, test_idx in gss.split(X, y, groups=story_ids): # We are generating the indices for the training testing sets while respecting the grouping defined by story_ids.
        X_train, X_test = X[train_idx], X[test_idx] # We split the feature matrix into training and testing subsets using the indices generated by Groupeshufflesplit.
        y_train, y_test = y[train_idx], y[test_idx] # We split the labels into training and testing subsets using the indices generated by Groupeshufflesplit.
    return X_train, X_test, y_train, y_test

In [None]:
# Model Stage

def train_model(X_train, y_train):
    svm_model = SVC(kernel='rbf', class_weight='balanced', random_state=42) # Defining the SVM with rbf kernel and to balance the classes.
    svm_model.fit(X_train, y_train) # We fit the training data into the model.
    return svm_model

def evaluate_model(model, X_test, y_test, X_train, y_train):
    y_train_pred = model.predict(X_train) # Performing prediction on whether the story is true or deceptive the training data.
    train_accuracy = accuracy_score(y_train, y_train_pred) # Calculating the training accuracy.
    y_pred = model.predict(X_test) # Performing prediction on the test data.
    accuracy = accuracy_score(y_test, y_pred) # Calculating the test accuracy.
    report = classification_report(y_test, y_pred) # Calculating the performance metrics.
    return train_accuracy, accuracy, report


# 6 Experiments and results

In [None]:
# Feature extraction
X, y, story_ids = getXy_multiple_segments(audio_files, labels_file)

# Transformation
X_transformed, scaler, pca = transform_features(X, n_components=6) # We have tried different n_components and have determined that 6 principal components prevent the model from overfitting to the training data and improve the test accuracy.

# Dataset splitting
X_train, X_test, y_train, y_test = split_dataset(X_transformed, y, story_ids)

# Model training
svm_model = train_model(X_train, y_train)

# Model evaluation
train_accuracy, accuracy, report = evaluate_model(svm_model, X_test, y_test, X_train, y_train)

# Print results
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Classification Report:\n{report}")

100%|██████████| 100/100 [1:05:02<00:00, 39.03s/it]


Train Accuracy: 0.8208
Test Accuracy: 0.7532
Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.85      0.82       105
        True       0.63      0.55      0.59        49

    accuracy                           0.75       154
   macro avg       0.71      0.70      0.71       154
weighted avg       0.75      0.75      0.75       154



**Training accuracy**:

The model has a training accuracy of 82.08% which indicates that it performs well on the training data and has learned meaningful patterns.

**Test accuracy**:

The model has a test accuracy of 75.32% which indicates that it is performing relatively well on the test data.

**Precision**:

The model predicts 80% of the deceptive stories as deceptive as indicated by precision for the false class.

The model predicts 63% of the true stories as true as indicated by the precision for true class. This suggests that the model is less confident in distinguishing true stories.

**Recall**:

The model predicts 85% of the actual deceptive stories as deceptive as indicated by the recall for false class.

The model predicts only 55% of the actual true stories as true, indicating that the model has more false negatives.

**F1 Score**:

High F1-score of 82% indicates balanced performance for the false class.

Lower F1-score of 59% indicates that the model is finding it difficult to predict the true class.

**Macro Average**:

The macro average give the majority and minority class the same weights hence a poor performance on the minority class will result in lower macro average score.The model struggles more with distinguishing the true class as indicated by the precision of 71%, recall of 70% and F1 score of 71%.

**Weighted Average**:

The weighted average gives a realistic estimate of the model's performance on the dataset by weighting metrics according to the number of samples in each class. Since the majority class has more weight it ca mask the poor performance of the minority class. The model's overall performance is slightly inflated by the larger deceptive class with a precision,recall and F1 score of 75%.

# 7 Conclusions

In this project, we have addressed the problem of detecting deception from audio samples. Using the SVM model, we achieved an accuracy of 75% on the test dataset. While this is a promising baseline, there is still room for improvement:

1. The dataset is limited, which causes the model to overfit the training data. This issue can be mitigated by collecting more labeled audio samples to improve model generalization.

2. Not all extracted features may be equally important for detecting deception. By performing feature selection, we can focus on features with strong correlations to the true or deceptive labels and eliminate noise.

3. While SVM performed well, exploring more advanced models like deep learning (e.g., CNNs or RNNs) or ensemble models (e.g., XGBoost, LightGBM) could improve accuracy by capturing more complex patterns in the data.

4. Our feature extraction was done without deep domain expertise. Collaborating with domain experts in psychology or linguistics could help identify new features strongly correlated with deception, further refining the model.

In conclusion, we have built a solid foundation for detecting deception using audio samples. By collecting more data, selecting the most relevant features, and exploring more advanced models with expert guidance, we can improve the accuracy and robustness of this approach.

# 8 References

1. Streeter, L. A., Krauss, R. M., Geller, V., Olson, C., & Apple, W. (1977). Pitch changes during attempted deception. Journal of Personality and Social Psychology, 35(5), 345–350. https://doi.org/10.1037/0022-3514.35.5.345

2. Rockwell, P., Buller, D. B., & Burgoon, J. K. (1997). The voice of deceit: Refining and expanding vocal cues to deception. Communication Research Reports, 14(4), 451–459.

3. Anolli, L., & Ciceri, R. (1997). The voice of deception: Vocal strategies of naive and able liars. Journal of Nonverbal Behavior, 21(4), 259–284.

4. Levitan, S. I., Maredia, A., & Hirschberg, J. (2018). Acoustic-prosodic indicators of deception and trust in interview dialogues. Proceedings of Interspeech 2018, 416–420. https://doi.org/10.21437/Interspeech.2018-1214

5. Fan, C., Zhang, X., Wang, Y., & Wang, Y. (2015). Deceptive speech detection based on sparse representation. 2015 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), 1–5. https://doi.org/10.1109/ICSPCC.2015.7515793