# ECS7020P mini-project submission


## What is the problem?

This year's mini-project considers the problem of predicting whether a narrated story is true or not. Specifically, you will build a machine learning model that takes as an input an audio recording of **30 seconds** of duration and predicts whether the story being narrated is **true or not**. 


## Which dataset will I use?

A total of 100 samples consisting of a complete audio recording, a *Language* attribute and a *Story Type* attribute have been made available for you to build your machine learning model. The audio recordings can be downloaded from:

https://github.com/MLEndDatasets/Deception/tree/main/MLEndDD_stories_small

A CSV file recording the *Language* attribute and *Story Type* of each audio file can be downloaded from:

https://github.com/MLEndDatasets/Deception/blob/main/MLEndDD_story_attributes_small.csv




## What will I submit?

Your submission will consist of **one single Jupyter notebook** that should include:

*   **Text cells**, describing in your own words, rigorously and concisely your approach, each implemented step and the results that you obtain,
*   **Code cells**, implementing each step,
*   **Output cells**, i.e. the output from each code cell,

Your notebook **should have the structure** outlined below. Please make sure that you **run all the cells** and that the **output cells are saved** before submission. 

Please save your notebook as:

* ECS7020P_miniproject_2425.ipynb


## How will my submission be evaluated?

This submission is worth 16 marks. We will value:

*   Conciseness in your writing.
*   Correctness in your methodology.
*   Correctness in your analysis and conclusions.
*   Completeness.
*   Originality and efforts to try something new.

**The final performance of your solutions will not influence your grade**. We will grade your understanding. If you have an good understanding, you will be using the right methodology, selecting the right approaches, assessing correctly the quality of your solutions, sometimes acknowledging that despite your attempts your solutions are not good enough, and critically reflecting on your work to suggest what you could have done differently. 

Note that **the problem that we are intending to solve is very difficult**. Do not despair if you do not get good results, **difficulty is precisely what makes it interesting** and **worth trying**. 

## Show the world what you can do 

Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. **Potential employers are always looking for this kind of evidence**. 





-------------------------------------- PLEASE USE THE STRUCTURE BELOW THIS LINE --------------------------------------------

# Exploring Audio Features for Deceptive Storytelling Detection Using Machine Learning

# 1 Author

**Student Name**:  Erica Low Ee Zhin

**Student ID**: 231126519



# 2 Problem Formulation 

Describe the machine learning problem that you want to solve and explain what's interesting about it.

Detecting deception is a vital yet complex challenge, with applications across various fields, ranging from law enforcement, police investigation, private consulting to psychological research.[1] Traditional lie detection methods, such as polygraphs tests, have significant limitations as they rely on physiological measurements, like heartbeat, blood pressure, respiration and skin temperature. [1][2] This approach often fail to capture the nuanced nature of human deception, as one can supress physical signs of discomfort, leading to false conclusions. While previous research has extensively explored visual and behavioral cues, such as eye blink analysis, pupil dilation and facial expression analysis [2], there is limited work on predicting deception purely based on speech data. 

With this, this project aims to fill this gap by developing a machine learning model to determine whether a narrated story is truthful or deceptive based soley on audio features. Raw audio signals, being high-dimensional data, pose challenges for direct analysis, as we will be operating in a predictor space consisting of hundreds of thousands of dimensions. To address this, we extract a set of four key audio features - Power, Pitch Mean, Pitch Standard Deviation, and Voice Fraction, which are extracted from 30-second segments of audio recordings. These features provide a manageable and informative predictor space for the model. Unlike polygraph tests, this non-invasive approach only requires only a short audio sample and can be performed without the need for specialized expertise such as polygraphists or psychologists. In addition to that, this approach also eliminates the reliance on physiological measures, providing a portable, more easily accessible tool for deception detection that individuals cannot easily manipulate. 

The problem of deceptive storytelling detection is framed as a binary classification task, with the labels of `true_story` and `deceptive_story`. However, the inherent variability in audio data, including differences in languages, accents, and speaking styles, makes this task particularly complex. To overcome these challenges, this project utilises ensemble classification techniques, by combining multiple classifiers models, aiming to improve the final predictive performance and create a reliable machine learning pipleline for detecting audio-based deception. 

# 3 Methodology

Describe your methodology. Specifically, describe your training task and validation task, and how model performance is defined (i.e. accuracy, confusion matrix, etc). Any other tasks that might help you build your model should also be described here.

#### **3.1 Data Preprocessing**
- All 100 audio recordings were divided into several 30-second chunks based on the original audio file length to standardize input lengths for the machine learning models. Any chunks shorter than 30 seconds were considered invalid and discarded.
- Each chunk was assigned a unique identifier (e.g., 00001.wav_chunk1), and chunks from the same recording were given the same File ID.
- The sampling rate of each chunk was identified and recorded to ensure compatibility across the dataset.


#### **3.2 Feature Extraction and Labels** 
- From each 30-second chunk, the following four audio features were extracted: 
  - **Power** : Represents the loudness of the audio.
  - **Pitch Mean** : Meaures the average highness or lowness of the sound.
  - **Pitch Standard Deviation** : Measures the pitch variability, distinguishing steady tones and shaky tones 
  - **Fraction of voiced region.** : Represents the proportion of time the audio contains voiced sounds. 
- The extracted features were combined into a single predictor matrix, where each row corresponds to a chunk and each column represents a feature.
- The lables for each 30 second chunks were binary, where 0 represents `true_story` and `1` represents `deception_story`. Class distribution was carefully monitored to maintain balance during preprocessing and splitting.


#### **3.3 Training Task**
- Before fitting into classifiers models, all of the features were standardized using `StandardScaler` to normalize data for consistency across models.
- Using `StratifiedGroupKFold`, the whole dataset was then split into training set (70%) and validation set (30%). This is to ensure both balanced class distribution in both sets and prevention of data leakage by ensuring chunks from the same audio file were not present in both training and validation sets.
- The training set consisted of 276 chunks, and the validation set contained 144 chunks. Both sets had a similar distribution of labels: approximately 48% true_story and 52% deceptive_story.

#### **3.4 Validation Task**
- The validation task aimed to assess model performance on unseen data while maintaining class balance. The validation set was evaluated after training on the following metrics:
  - **Accuracy**: The proportion of correct predictions.
  - **F1 Score**: A balanced measure of precision and recall to account for class imbalance.

  Accuracy: This metric measures the overall correctness of predictions and provides a general sense of model performance.
F1 Score: This metric was prioritized, as it balances precision and recall, making it particularly suitable for this task where both false positives and false negatives carry significant implications.
  
#### **3.5 Model Training**
- A total of four classifiers models were trained and evaluated, Logistic Regression, Random Forest, Gradient Boosting and Support Vector Machine (SVM).

- Each model's performance was assessed using training and validation of accuracy and F1 scores.

- A soft-voting ensemble classifier was created by combining the two best-performing models (based on validation F1 scores). The ensemble model was evaluated using validation accuracy and validation F1 score.

# 4 Implemented ML prediction pipelines

Describe the ML prediction pipelines that you will explore. Clearly identify their input and output, stages and format of the intermediate data structures moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. After providing an overview, describe in more detail each one of the stages that you have included in their corresponding subsections (i.e. 4.1 Transformation stage, 4.2 Model stage, 4.3 Ensemble stage).

This section describes the overview of the machine learning (ML) prediction pipelines implemented for the audio-based deception detection. These pipelines consist of multiple stages, including: 
- 4.1 Feature Transformation
- 4.2 Model Training
- 4.3 Ensemble Learning

#### **Pipeline Overview**
**1. Input:** 

100 of raw audio recordings, with each individual recording is splitted into 30-second chunks. Each chunk is labeled as either true_story (0) or deceptive_story (1). These audio chunks serve as the primary input, providing the basis for subsequent feature extraction and model training. The data format at this stage includes:
- Unique identifiers for each chunk (e.g., FileID_ChunkID).
- Labels (true_story or deceptive_story) to guide supervised learning.

**2. Transformation Stage:** 

In this stage, the raw audio data is transformed into a structured feature matrix that can be used for machine learning. Key steps include:
- Feature Extraction: A total of four audio features were extracted, including power, pitch mean, pitch standard deviation, and voiced fraction—from each chunk. These features are selected for their relevance to detecting deceptive behavior.

- Feature Standardization: Using a StandardScaler to normalize feature values, ensuring consistent scales across the dataset.

The output of this stage is a feature matrix, where rows represent individual audio chunks and columns correspond to the extracted features, file ID and labels.

**3. Model Stage:**

Next, the feature matrix were splitted into 70% of training datasets and 30% of validation datasets. Then, the standardized features were trained with various ML classifiers models to learn the patterns associated with truthful and deceptive stories. The classifiers are Logistic Regression, Random Forest, Gradient Boosting, and SVM. Then, each model was evaluated using accuracy and F1 scores on both training and validation datasets to look at their predictive performance and generalizability.

The output of this stage is a dataframe which concludes each model evaluation metrisc of accuracy and F1 scores on both training and validation datasets.

**4. Ensemble Stage:**

In this stage, the top two best performing models (based on validation F1 scores) are combined into an ensemble using a soft-voting approach. Then this ensembled model is then trained and evaluated on the same set of training and validation sets as the individual models previously. 

The output of this stage is the validation accuracy and F1 score of the ensemble model and this results are compared with the individual model performace to check the capability of ensemble models in improving deceptive storytelling detection. 


## **4.1 Transformation Stage**

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

This section outlines the transformation stage, detailing the input and output, and explaining the rationale behind the chosen preprocessing and feature extraction methods.

### **4.1.1 Input:**
- 100 raw audio files labeled with their corresponding File ID, Language, and Story Type (true_story or deceptive_story), stored in the `MLEND_df` dataframe.

### **4.1.2 Audio Splitting:**
- Each raw audio file is divided into 30-second chunks to standardize input length for feature extraction and model training.
- Chunks shorter than 30 seconds are discarded to ensure consistency in input duration and to prevent bias caused by shorter chunks having different statistical properties.
- Metadata such as total duration, sample rate, total number of chunks, valid chunks are recorded.
- Intermediate output: A `file_metadata` dataframe that summarizes the raw audio statistics for each file, including File ID, Duration(s), Sample Rate, Total Chunks, Valid Chunks (30s), and Story Type. Another dataframe, `chunk_df` containing only valid 30s chunks, with columns for File ID, Chunk ID, and Story Type.

### **4.1.3 Features Extraction and Labelling:**
- In this section, each valid 30s chunks in the `chunk_df` undergo feature extraction to derive four key audio features which capture the audio patterns relevant to deception storytelling detection, which are: 
  - **Power**: Measures the energy of the audio signal to capture intensity patterns.
  - **Pitch Mean and Standard Deviation**: Statistical measures of pitch to capture frequency variations relevant to storytelling dynamics.
  - **Voiced Fraction**: Proportion of time the audio contains voiced sounds, which is important for speech analysis.
-The story types (true_story and deceptive_story) are encoded as label of 0 and 1, respectively in each valid chunks.

### **4.1.4 Output:**
- A feature matrix saved as `extracted_features.csv`, where each row represents a valid chunk, with columns for the four features, File ID, and the corresponding label.
This CSV storage is to reduce processing time by avoiding redundant feature extraction during multiple runs.

## **4.2 Model Stage**
Describe the ML model(s) that you will build. Explain why you have chosen them.

Four machine learning models were selected for training and evaluation, each offering unique strengths and capabilities to ensure a generalised and robust analysis:

### **4.2.1 Machine Learning Models**:
- **Logistic Regression**: 
    - A simple linear model, served as a baseline to assess whether the audio dataset is linearly seperable. 
    - It is easily interpretable and offers quick implementation, which facilitates direct comparison with more complex models.
- **Random Forest**: 
    - A non-linear ensemble model based on decision trees, which able to detect non-linear relationships.
    - The parameters were configured with `max_depth`=3, `n_estimators`=10, to prevent overfitting, ensuring that the model does not memorise training data while maintaining its predictive power. 
- **Gradient Boosting**: 
    - An iterative model that builds strong classifiers by combining weak learners to optimise the model performance over several iterations. This model is suitable for imbalanced datasets like our audio datasets which has 52% of true stories and 48% of deceptive stories in both training and validation dataset.
    - The parameter was configured with `max_depth`=3, and `n_estimators`=30, to prevent overfitting. 
- **Support Vector Machine (SVM)**: 
    - A powerful classifier with good performance for smaller datasets and higher-dimensional feature spaces, which suitables our audio dataset (420 audio files and high dimensional audio signals).
    - The model uses an "rbf" kernel, wuth C=3, and the gamma ="scale", to construct non-linear boundaries which improves its ability to seperate the two classes effectively.
    - "rbf" kernel is used as the extracted features (power, pitch mean/std, voiced fraction) are likely to exhibit non-linear relationships that distinguish true and deceptive storytelling. As both truth and deceptive speech might involve complex combinations of pitch variation and energy that a linear decision boundary cannot effectively separate.

## **4.3 Ensemble Stage**
Describe any ensemble approach you might have included. Explain why you have chosen them.

This section introduces the ensemble approach utilized in this project, explaining its methodology and the rationale for selecting it to enhance overall model predictive performance.

### **4.3.1 Soft Voting Ensemble:**
- The top two best model performance (based on Validation F1 score) was chosen and by utilising soft voting method, the class prediction probabilities from the selected models was averaged and hence predicting the final class probabilities. 

### **4.3.2 Reasons of using this approach:**
- Unlike hard voting, soft voting ensemble could leverage the strengths of each models, balancing the strengths of each models, creating a more robust-decision making process which is useful in this deception storytelling detection, which involving complex relationships in the audio features.
- By averaging the probabilities, the ensemble models are less sensitive to overconfident predictions from single model. 
- Soft voting ensembles combines different perspectives, leading to a more reliable final predictions.

# 5 Dataset

Describe the datasets that you will create to build and evaluate your models. Your datasets need to be based on our MLEnd Deception Dataset. After describing the datasets, build them here. You can explore and visualise the datasets here as well. 

If you are building separate training and validation datasets, do it here. Explain clearly how you are building such datasets, how you are ensuring that they serve their purpose (i.e. they are independent and consist of ID samples) and any limitations you might think of. It is always important to identify any limitations as early as possible. The scope and validity of your conclusions will depend on your ability to understand the limitations of your approach.

If you are exploring different datasets, create different subsections for each dataset and give them a name (e.g. 5.1 Dataset A, 5.2 Dataset B, 5.3 Dataset 5.3) .



- 100 audio files, with varies length with each is at least 2 minutes and above 
- 50 truth and 50 deceptive story
- have different languages - english (78 number), and other langusges( ? numbers) Hindi' 'Bengali' 'Kannada' 'French' 'Arabic' 'Russian'
 'Chinese, Mandarin' 'Marathi' 'Portuguese' 'Spanish' 'Swahilli' 'Telugu'
 'Korean' 'Cantonese' 'Italian'

- normalised all recordings into 30 seconds only as per required 

The datasets for this project are based on the MLEnd Deception Dataset,which can be installed in the following four steps.

In [6]:
##Step 1: Install the Required Library##
# pip install mlend==1.0.0.4

##Step 2: Import Library and Functions##
# import mlend
# from mlend import download_deception_small, deception_small_load

##Step 3: Download small data##
# datadir = download_deception_small(save_to='MLEnd', subset={}, verbose=1, overwrite=False)

##Step 4: Read file paths##
# base_path = './MLEnd/deception/MLEndDD_stories_small/'
# MLEND_df = pd.read_csv('./MLEnd/deception/MLEndDD_story_attributes_small.csv').set_index('filename')
# files = [base_path + file for file in MLEND_df.index]

The MLEnd Deception Dataset comprises 100 raw audio recordings of narrated stories, varying in language and duration, with each recording labeled as either a `true_story` or `deceptive_story`.

In [9]:
#Data Loading 
import numpy as np
import pandas as pd
from tqdm import tqdm

base_path = './MLEnd/deception/MLEndDD_stories_small/'
MLEND_df = pd.read_csv('./MLEnd/deception/MLEndDD_story_attributes_small.csv').set_index('filename')
files = [base_path + file for file in MLEND_df.index]

display(MLEND_df.head())
print(f"We have a total of {len(files)} audio files in the dataset.")


#Langauge Distribution
language_counts = MLEND_df['Language'].value_counts()
language_df = pd.DataFrame(language_counts).transpose()
language_df['Sum'] = language_counts.sum()
print("\nLanguages narrated in the dataset are:")
display(language_df)

#Data Distribution
story_type_counts = MLEND_df['Story_type'].value_counts()
print(story_type_counts)

Unnamed: 0_level_0,Language,Story_type
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
00001.wav,Hindi,deceptive_story
00002.wav,English,true_story
00003.wav,English,deceptive_story
00004.wav,Bengali,deceptive_story
00005.wav,English,deceptive_story


We have a total of 100 audio files in the dataset.

Languages narrated in the dataset are:


Language,English,Hindi,Arabic,"Chinese, Mandarin",Marathi,Bengali,Kannada,French,Russian,Portuguese,Spanish,Swahilli,Telugu,Korean,Cantonese,Italian,Sum
count,78,4,3,2,2,1,1,1,1,1,1,1,1,1,1,1,100


Story_type
deceptive_story    50
true_story         50
Name: count, dtype: int64


According to the results above, we have a balanced dataset of 50 true stores and the remainining 50 narrated stories are deceptive. The 100 recordings were narrated in a total of 16 different languages as shown in `langugaes_df` above, where English is the dominant language of the narrated stories of a total 78 recordings, the followed by other languages. 

Next, we will create two dataframes of

In [10]:
import librosa
import pandas as pd
from tqdm import tqdm

metadata_rows = []
chunk_rows = []

for file_id in tqdm(MLEND_df.index):
    file_path = base_path + file_id
    story_type = MLEND_df.loc[file_id, 'Story_type']

    audio_data, original_sr = librosa.load(file_path, sr=None)

    # Total duration and chunk size
    total_duration = len(audio_data) / original_sr
    chunk_size = int(30 * original_sr)  # 30 seconds in samples

    # Split the audio into chunks
    chunks = [audio_data[i:i + chunk_size] for i in range(0, len(audio_data), chunk_size)]
    valid_chunks = [chunk for chunk in chunks if len(chunk) == chunk_size]

    # Metadata
    metadata_rows.append({"File ID": file_id,
                          "Total Duration (s)": total_duration,
                          "Total Chunks": len(chunks),
                          "Valid Chunks": len(valid_chunks),
                          "Story Type": story_type,
                          "Original Sample Rate": original_sr})

    # Valid Chunks
    for i, chunk in enumerate(valid_chunks):
        chunk_rows.append({"File ID": file_id,
                           "Chunk ID": f"{file_id}_chunk{i + 1}",
                           "Chunk Data": chunk,
                           "Story Type": story_type,
                           "Sample Rate": original_sr})

# Convert to DataFrames
metadata_df = pd.DataFrame(metadata_rows)
chunks_df = pd.DataFrame(chunk_rows)

# Display Results
print("Summary of Audio Files:")
display(metadata_df)
print("\nSummary of Valid Audio Chunks:")
display(chunks_df)

# Summary Statistics
print("\nSummary Statistics:")
print(f"Total Files Processed: {len(metadata_df)}")
print(f"Total Chunks Created: {metadata_df['Total Chunks'].sum()}")
print(f"Total Valid Chunks (30s): {metadata_df['Valid Chunks'].sum()}")
print(f"Unique Sample Rates: {metadata_df['Original Sample Rate'].unique()}")

# True and Deceptive Story Distribution
valid_chunk_labels = chunks_df['Story Type'].value_counts()
print("\nCount of True and Deceptive Stories from Valid Chunks (30s):")
for story_type, count in valid_chunk_labels.items():
    print(f"{story_type}: {count} chunks")




100%|██████████| 100/100 [00:05<00:00, 16.96it/s]

Summary of Audio Files:





Unnamed: 0,File ID,Total Duration (s),Total Chunks,Valid Chunks,Story Type,Original Sample Rate
0,00001.wav,122.167256,5,4,deceptive_story,44100
1,00002.wav,125.192018,5,4,true_story,44100
2,00003.wav,162.984127,6,5,deceptive_story,44100
3,00004.wav,121.681270,5,4,deceptive_story,44100
4,00005.wav,134.189751,5,4,deceptive_story,44100
...,...,...,...,...,...,...
95,00096.wav,111.512063,4,3,deceptive_story,44100
96,00097.wav,185.731224,7,6,true_story,44100
97,00098.wav,128.252766,5,4,deceptive_story,44100
98,00099.wav,132.412562,5,4,true_story,44100



Summary of Valid Audio Chunks:


Unnamed: 0,File ID,Chunk ID,Chunk Data,Story Type,Sample Rate
0,00001.wav,00001.wav_chunk1,"[1.5258789e-05, 1.5258789e-05, 3.0517578e-05, ...",deceptive_story,44100
1,00001.wav,00001.wav_chunk2,"[0.027450562, 0.026519775, 0.025390625, 0.0242...",deceptive_story,44100
2,00001.wav,00001.wav_chunk3,"[-0.00091552734, -0.0011138916, -0.0013122559,...",deceptive_story,44100
3,00001.wav,00001.wav_chunk4,"[6.1035156e-05, 9.1552734e-05, 7.6293945e-05, ...",deceptive_story,44100
4,00002.wav,00002.wav_chunk1,"[0.0008239746, 0.0008239746, 0.00088500977, 0....",true_story,44100
...,...,...,...,...,...
415,00099.wav,00099.wav_chunk4,"[-3.0517578e-05, -3.0517578e-05, -3.0517578e-0...",true_story,44100
416,00100.wav,00100.wav_chunk1,"[-0.00018310547, -0.00015258789, -6.1035156e-0...",deceptive_story,44100
417,00100.wav,00100.wav_chunk2,"[0.0004272461, 0.00048828125, 0.0005187988, 0....",deceptive_story,44100
418,00100.wav,00100.wav_chunk3,"[6.1035156e-05, 0.0, -6.1035156e-05, -6.103515...",deceptive_story,44100



Summary Statistics:
Total Files Processed: 100
Total Chunks Created: 520
Total Valid Chunks (30s): 420
Unique Sample Rates: [44100 48000]

Count of True and Deceptive Stories from Valid Chunks (30s):
true_story: 219 chunks
deceptive_story: 201 chunks


# **5 Dataset**

The datasets for this project are based on the MLEnd Deception Dataset, which consists of 100 audio recordings, each labeled as either `true_story` or `deceptive_story`. These datasets are created to train, validate, and evaluate machine learning models for deception detection. Below, we describe the datasets, the process of building them, the methods used to ensure independence and validity, and any limitations observed.

---

## **5.1 Dataset Creation**

### **5.1.1 Preprocessing and Chunking**
- **Original Data**:
  - The MLEnd Deception Dataset consists of 100 audio files, each representing a narrated story labeled as `true_story` (0) or `deceptive_story` (1).

- **Chunking**:
  - Each audio file was split into 30-second chunks. Chunks shorter than 30 seconds were discarded as non-valid chunks to maintain consistency and standardization.
  - Valid chunks were assigned a unique identifier, with the format `FileID_ChunkID` (e.g., `00001.wav_chunk1`).
  - All chunks from the same audio file retained the same `FileID`, ensuring the ability to track which chunks originated from the same source.

- **Chunk Statistics**:
  - A total of 520 chunks were created, of which 420 were valid (≥30 seconds) and 100 were discarded as non-valid.

- **Metadata Recording**:
  - For each valid chunk, the following metadata was recorded:
    - `FileID`: Original file identifier.
    - `ChunkID`: Unique identifier for the chunk.
    - `Duration`: Length of the audio chunk.
    - `Label`: Binary label (`true_story` or `deceptive_story`).

---

### **5.1.2 Feature Extraction**
For each valid chunk, the following four audio features were extracted:
- **Power**: Measures the energy of the audio signal.
- **Pitch Mean**: Represents the average pitch of the audio.
- **Pitch Standard Deviation**: Captures variations in pitch.
- **Voiced Fraction**: Proportion of time the audio contains voiced sounds.

These features were combined into a feature matrix, where each row corresponds to a chunk and the columns represent the extracted features.

---

## **5.2 Training and Validation Dataset**

### **5.2.1 Data Splitting**
- **Purpose**:
  - The dataset was split into training (70%) and validation (30%) sets to build and evaluate models.
  - Ensuring independence by preventing data leakage (i.e., chunks from the same file do not appear in both sets).

- **Method**:
  - Used `StratifiedGroupKFold` to split the data while:
    - Maintaining balanced class distributions in both sets.
    - Grouping chunks by `FileID` to ensure all chunks from the same file were placed in either the training or validation set.

- **Result**:
  - Training Set:
    - Contains 276 valid chunks.
    - Balanced class distribution: ~48% `true_story`, ~52% `deceptive_story`.
  - Validation Set:
    - Contains 144 valid chunks.
    - Balanced class distribution: ~48% `true_story`, ~52% `deceptive_story`.

---

### **5.2.2 Visualization**
- **Exploring Class Distribution**:
  - Visualized the class distributions in both training and validation sets to confirm balance.
- **Feature Analysis**:
  - Explored the distribution of each feature (e.g., Power, Pitch Mean, etc.) using histograms and box plots to identify patterns and outliers.

---

## **5.3 Limitations**
1. **Dataset Size**:
   - The dataset is relatively small, with only 420 valid chunks, which may limit the generalizability of the models.

2. **Class Imbalance**:
   - While the class distribution is approximately balanced, slight variations may still affect the model's sensitivity to one class over the other.

3. **Language and Accent Variability**:
   - Differences in language, accents, and speaking styles across the dataset may introduce variability that the features cannot fully capture.

4. **Limited Features**:
   - Only four audio features are used, which may not capture all the nuances needed for deception detection. Future work could explore more sophisticated feature extraction techniques.

5. **Chunk Independence**:
   - While grouping chunks by `FileID` ensures no data leakage, chunks within the same file may still exhibit similarities that could bias the model if overrepresented in either set.

---

## **5.4 Future Improvements**
- **Augmenting Data**:
  - Apply data augmentation techniques to increase the diversity and size of the dataset.
- **Additional Features**:
  - Explore more advanced audio features or deep learning-based embeddings for richer representations.
- **Cross-Validation**:
  - Employ cross-validation techniques to better evaluate model performance and reduce the impact of dataset variability.

This methodology ensures the creation of independent, balanced, and meaningful datasets for building and evaluating the machine learning models while acknowledging limitations and areas for improvement.


# 6 Experiments and results

Carry out your experiments here. Analyse and explain your results. Unexplained results are worthless.

# **6 Experiments and Results**

This section documents the experiments conducted to evaluate the performance of various machine learning models for audio-based deception detection. The results are analyzed and explained in detail, highlighting key observations and potential areas for improvement.

---

## **6.1 Experiment Setup**

### **6.1.1 Models Tested**
The following models were trained and evaluated:
1. **Logistic Regression**: A baseline linear classifier.
2. **Random Forest**: An ensemble-based classifier capable of capturing non-linear relationships.
3. **Gradient Boosting**: A boosting method that iteratively refines weak learners.
4. **Support Vector Machine (SVM)**: A robust classifier for smaller datasets and high-dimensional spaces.

### **6.1.2 Evaluation Metrics**
- **Accuracy**: Measures the proportion of correct predictions.
- **F1 Score**: Balances precision and recall, particularly important for handling class imbalance.

### **6.1.3 Training and Validation Setup**
- **Data Split**: The dataset was divided into 70% training and 30% validation using `StratifiedGroupKFold`.
- **Feature Standardization**: All features were standardized using `StandardScaler` for consistent scaling.
- **Validation Task**: Validation sets were used to evaluate model generalization on unseen data.

---

## **6.2 Experiment Results**

### **6.2.1 Individual Model Performance**
| Model                 | Training Accuracy | Validation Accuracy | Training F1 Score | Validation F1 Score |
|-----------------------|-------------------|---------------------|-------------------|---------------------|
| Logistic Regression   | 56.88%           | 49.31%             | 49.79%           | 27.72%             |
| Random Forest         | 79.71%           | 53.47%             | 79.41%           | 43.70%             |
| Gradient Boosting     | 89.86%           | 53.47%             | 89.63%           | 46.40%             |
| SVM                   | 79.35%           | 50.00%             | 79.27%           | 44.62%             |

### **6.2.2 Ensemble Performance**
An ensemble model combining **Gradient Boosting** and **SVM** was evaluated using soft voting:
- **Validation Accuracy**: 52.08%
- **Validation F1 Score**: 43.90%

---

## **6.3 Analysis of Results**

### **6.3.1 Observations**
1. **Gradient Boosting Performance**:
   - Achieved the highest training accuracy and F1 score, indicating strong learning on the training set.
   - However, its validation scores suggest potential overfitting, as it struggles to generalize to unseen data.

2. **Random Forest Performance**:
   - Exhibited a good balance between training and validation scores, suggesting better generalization than Gradient Boosting.

3. **SVM Performance**:
   - Performed consistently across training and validation, highlighting its robustness for smaller datasets.

4. **Logistic Regression**:
   - Had the lowest performance among all models, indicating that the data's decision boundary is non-linear and cannot be effectively captured by a linear model.

5. **Ensemble Model**:
   - Combined the strengths of Gradient Boosting and SVM, leading to slightly improved validation F1 scores compared to individual models.

---

### **6.3.2 Key Insights**
- **Feature Limitations**:
  - The four extracted features (Power, Pitch Mean, Pitch Standard Deviation, and Voiced Fraction) may not fully capture the complexity of deception in audio data, limiting the models' performance.
  
- **Overfitting in Complex Models**:
  - Gradient Boosting's strong performance on training data but weaker generalization on validation data indicates overfitting. More regularization or additional training data may help mitigate this.

- **Class Balance**:
  - Balanced class distribution in training and validation sets contributed to consistent performance across metrics but may not fully address nuances in the data.

---

## **6.4 Limitations**
1. **Small Dataset**:
   - The relatively small size of the dataset (420 valid chunks) limits the models' ability to generalize, particularly for complex classifiers.

2. **Feature Representation**:
   - Using only four audio features may not adequately represent the intricacies of deception, leading to limited predictive power.

3. **Chunk Dependency**:
   - Although chunks from the same file were placed in either training or validation sets, their shared characteristics could still introduce subtle dependencies.

---

## **6.5 Future Improvements**
1. **Feature Engineering**:
   - Extract additional features, such as MFCCs, spectral features, or embeddings from pre-trained audio models, to enrich the predictor space.

2. **Data Augmentation**:
   - Introduce synthetic audio variations (e.g., pitch shifts, time stretching) to increase dataset diversity and size.

3. **Advanced Models**:
   - Experiment with deep learning approaches, such as recurrent neural networks (RNNs) or transformers, to better capture temporal and contextual information in audio data.

4. **Cross-Validation**:
   - Use cross-validation to better evaluate model performance and minimize the impact of data splits on results.

---

## **6.6 Conclusion**
The experiments highlight the potential of machine learning for deception detection using audio data. While the ensemble model showed modest improvements, the results underline the need for richer features, larger datasets, and advanced modeling techniques to achieve robust performance. These insights will guide future iterations of the project.


# 7 Conclusions

Your conclusions, suggestions for improvements, etc should go here.

# **7 Conclusions**

This project explored the use of machine learning models for audio-based deception detection. By leveraging the MLEnd Deception Dataset, audio features were extracted, processed, and used to train multiple classification models. While the results demonstrate the potential of machine learning for this challenging task, they also highlight key areas for improvement. Below are the main conclusions and suggestions for future work:

---

## **7.1 Conclusions**
1. **Model Performance**:
   - Gradient Boosting and Random Forest showed the best performance among the tested models, with Gradient Boosting achieving the highest F1 score on the validation set.
   - Logistic Regression underperformed, indicating that the dataset likely requires non-linear decision boundaries for effective classification.
   - The ensemble model, combining Gradient Boosting and SVM, slightly improved validation F1 scores but still faced generalization challenges.

2. **Feature Representation**:
   - The four extracted features (Power, Pitch Mean, Pitch Standard Deviation, and Voiced Fraction) provided a good starting point for analysis. However, these features alone may not capture the full complexity of deception in audio data.

3. **Dataset Challenges**:
   - The small size of the dataset (420 valid chunks) and inherent variability in audio (e.g., accents, speaking styles) made it difficult for models to generalize effectively.
   - Balancing the dataset and preventing data leakage through `StratifiedGroupKFold` ensured fairness and validity of the results.

4. **Practicality of the Approach**:
   - This non-invasive, audio-based deception detection method offers advantages over traditional polygraph tests, such as accessibility, portability, and ease of use.

---

## **7.2 Suggestions for Improvements**
1. **Feature Engineering**:
   - Incorporate more advanced features, such as:
     - **Mel-Frequency Cepstral Coefficients (MFCCs)** for richer frequency representation.
     - **Spectral Features** to capture detailed audio dynamics.
     - **Embeddings** from pre-trained audio models (e.g., OpenL3, Wav2Vec) for deeper contextual understanding.
   - Explore temporal features or sequential patterns in the audio data using time-series analysis.

2. **Data Augmentation**:
   - Apply augmentation techniques, such as pitch shifts, time stretching, and noise injection, to increase dataset size and diversity.

3. **Advanced Modeling**:
   - Experiment with deep learning models like:
     - **Recurrent Neural Networks (RNNs)** to capture sequential dependencies.
     - **Convolutional Neural Networks (CNNs)** to analyze spectrograms or feature maps.
     - **Transformers** for contextual and temporal learning on audio sequences.

4. **Cross-Validation**:
   - Implement cross-validation across multiple folds to provide a more robust evaluation of model performance.

5. **Larger and More Diverse Datasets**:
   - Expand the dataset to include more recordings with a broader range of languages, accents, and speaking styles.
   - Incorporate external datasets, if available, to increase training data volume.

6. **Interpretability and Explainability**:
   - Investigate which features contribute most to model predictions using techniques like SHAP (SHapley Additive exPlanations) or feature importance analysis.

---

## **7.3 Final Thoughts**
This project demonstrates the feasibility of using machine learning for audio-based deception detection, highlighting both opportunities and challenges. While initial results are promising, achieving robust and generalizable models will require significant improvements in dataset quality, feature engineering, and modeling techniques. By addressing these challenges, future iterations of this work could pave the way for practical and reliable tools for detecting deception in various real-world applications.


# 8 References

Acknowledge others here (books, papers, repositories, libraries, tools) 

1. An Acoustic Automated Lie Detector _Alice Xue_2019.pdf
2. A_Comprehensive_Review_on_Features_Extraction_and_Features_Matching_Techniques_for_Deception_Detection.pdf

Deciding whether to keep or discard audio chunks shorter than 30 seconds depends on your project goals, data characteristics, and modeling approach. Here's a detailed look at the advantages and limitations of both approaches:

1. Discarding Chunks Shorter Than 30s
Advantages:
Consistency:

All chunks are of the same duration (30s), which simplifies feature extraction and model training.
Models trained on consistent chunk sizes avoid handling varying feature lengths (e.g., MFCC arrays).
Avoids Data Imbalance:

Shorter chunks may have different statistical properties compared to 30s chunks, potentially introducing bias in model training.
Simplified Processing:

Uniform chunk length makes batch processing easier and faster during training and inference.
Feature Reliability:

Features like MFCCs, pitch, and RMS are more stable and meaningful for consistent input durations.
Limitations:
Data Loss:

Discarding shorter chunks reduces the size of your dataset, which can be critical if you already have a small dataset.
Important patterns or information in those shorter chunks are lost.
Potential Bias:

If certain types of audio (e.g., deceptive stories) are more likely to have shorter durations, discarding chunks can skew the dataset.
2. Keeping Chunks Shorter Than 30s
Advantages:
Maximizes Data:

Retains every available audio chunk, which can be critical for small datasets.
Helps increase the training sample size and improve model generalization.
Preserves Information:

Retains all available information, especially if shorter chunks contain important features or patterns.
Limitations:
Feature Variability:

Shorter chunks will have fewer data points, resulting in different feature lengths (e.g., fewer MFCC frames), which may require additional preprocessing (e.g., padding or truncation).
Impact on Features:

Features like Zero-Crossing Rate (ZCR) and Energy (RMS) might be less meaningful for very short chunks.
Padding shorter chunks with zeros can distort features like ZCR.
Complexity:

Models may need additional handling for variable input lengths, such as:
Padding with zeros.
Using dynamic architectures like recurrent neural networks (RNNs) or transformers.
Possible Strategies
1. If Consistency is Key (Discard Shorter Chunks)
Remove chunks shorter than 30 seconds for a consistent input size.
Works well for simpler models like logistic regression, SVMs, or decision trees.
2. If Data is Limited (Keep Shorter Chunks)
Retain all chunks but:
Pad with Zeros: Extend shorter chunks to 30s by padding with zeros.
Truncate to Consistent Size: Extract the first few seconds of shorter chunks (e.g., first 5-10 seconds) for consistency.
Helps in small datasets where maximizing data size is critical.
Recommendation
If you have sufficient data (100 audio files, each with 30s chunks):

Discard shorter chunks for consistency, as this simplifies the processing pipeline and ensures robust feature extraction.
If the dataset is small or imbalanced:

Keep shorter chunks to maximize data but preprocess (e.g., padding or truncating) to maintain consistency.