# The Booker Data Audio Analysis

After reading the Booker Paper (i.e, "Narratives in the Immediate Aftermath of Traumatic Injury: Markers of Ongoing Depressive and Posttraumatic Stress Disorder Symptoms"), we noticed that it just focusses on the words spoken, and not the **way these words were spoken**.

In addiiton, reading other papers by prominent researchers such as Dr. David C Atkins, I also am curious regarding **measuring empathy of the interviewer** in the ED scenario. Empathy expressed by the therapist is also an indicator to recovery speed of the patient.

So here's what I propose. Let's use the Booker Paper Data, and apart from extracting the features they have mentioned in their paper, we can augment and create a more accurate model by including audio features from the patient audio, and empathy measurement on the interviewer speech. For Empathy calculation, we can follow something along the lines of what Dr. Atkins has demonstrated in his paper:  A Computational Approach to Understanding Empathy Expressed inText-Based Mental Health Support (https://arxiv.org/pdf/2009.08441.pdf)

So our model would consist of:
1. Features mentioned in the Booker Paper.
2. Audio Features extracted from patient audio.
3. Empathy calculation from therapist audio.

All these together could be a novel addition to the already existing work, and give our paper a more HCI-oriented direction.


## Notebook Structure

I have maintained the following structure throughout the notebook:
1. A brief description of that particular feature is presented.
2. Helper Functions are declared that can be eventually called for a particular audio file.
3. Helper Functions are then called for the ED audio file, and the subsequent follow-up sessions. The extracted values are printed, graphs are plotted to gelp us compare the result.
4. Observation, if any, are mentioned.


## Features Extracted

This notebook deals with "audio feature extraction" part of our research. The list of features Extracted are:  
1. Number of Onsets
2. Pitch Estimation
1. MFCC
2. Zero Crossings
3. Spectral Centroid
4. Spectrogram
5. Chroma FFT
6. Energy
7. RMS Energy
8. Spectral Rolloff
9. Phonation Rate
10. Speech Productivity
11. Speech Rate
12. Articulation Rate


# Starting off: Importing required modules and loading the three audio files

In [7]:
import librosa 
import librosa.display
from scipy.io import wavfile as wav
import speech_recognition as sr
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kurtosis
import sklearn
from pydub import AudioSegment 
from pydub.silence import split_on_silence 
import ffmpeg
from sklearn.svm import SVC

# Audio File Recorded at ED (hereby referred to as ED Audio)
filename1 = 'patient_audio/session2/low_sud_audio.wav'
duration1 = librosa.get_duration(filename=filename1)

# Audio File Recorded at first Follow-Up Session (hereby referred to as FLW-1 Audio)
filename2 = 'patient_audio/session2/medium_sud_audio.wav'
duration2 = librosa.get_duration(filename=filename2)

# Audio File Recorded at second Follow-Up Session (hereby referred to as FLW-2 Audio)
filename3 = 'patient_audio/session2/high_sud_audio.wav'
duration3 = librosa.get_duration(filename=filename3)

'''
In order to compare the three audio files properly, we should be analyzing equal sizes of audio snippets.
Please ensure that all audio snippets are 50 seconds long each, covering the audio of the most interest. 
Otherwise the code below is gonna snip it anyways.
Thus, snipping duration of all three audio files to the minimum of the three files.
'''
audio_duration = min(duration1, duration2, duration3)

flw_2_audio, sample_rate = librosa.load(filename1, duration=audio_duration) 
flw_1_audio, sample_rate = librosa.load(filename2, duration=audio_duration) 
ed_audio, sample_rate = librosa.load(filename3, duration=audio_duration) 

print('Librosa sample rate:', sample_rate)

Librosa sample rate: 22050


# Feature Extraction: Onset Detection

Onset refers to the beginning of a musical note or other sound. It is related to (but different from) the concept of a transient: all musical notes have an onset, but do not necessarily include an initial transient. We are locating note onset events by picking peaks in an onset strength envelope here.

In [5]:
# Helper Functions
'''
Creating Helper Function that checks number of onsets. 
    Param1: The Audio
'''
hop_length = 512
def number_of_onsets(audio):
    onset_frames = librosa.onset.onset_detect(audio, sr=sample_rate, hop_length=hop_length, backtrack=True)
    onset_times = librosa.frames_to_time(onset_frames, sr=sample_rate, hop_length=hop_length)
    return(len(onset_times))

In [8]:
print("Number of Onsets for ED Audio: "+str(number_of_onsets(ed_audio)))
print("Number of Onsets for FLW-1 Audio: "+str(number_of_onsets(flw_1_audio)))
print("Number of Onsets for FLW-2 Audio: "+str(number_of_onsets(flw_2_audio)))

Number of Onsets for ED Audio: 319
Number of Onsets for FLW-1 Audio: 377
Number of Onsets for FLW-2 Audio: 387


## Observation for Onset Detections

Higher the SUD, lesser onsets were detected. Not sure about the reason for this.

# Feature Extraction: Pitch Estimation

The lowest frequency of any vibrating object is called the fundamental frequency. The fundamental frequency provides the sound with its strongest audible pitch reference - it is the predominant frequency in any complex waveform.

A sine wave is the simplest of all waveforms and contains only a single fundamental frequency and no harmonics, overtones or partials.

Virtually all musical sounds have waves that are infinitely more complex than a sine wave. It is the addition of harmonics and overtones to a wave that makes it possible to distinguish between different sounds and instruments; the timbre.

In [15]:
# Helper Functions
'''
Creating Helper Function that estimates pitch. 
    Param1: The Audio
'''
def find_pitch(audio):
    
    r = librosa.autocorrelate(audio, max_size=10000)
    midi_hi = 120.0
    midi_lo = 12.0
    f_hi = librosa.midi_to_hz(midi_hi)
    f_lo = librosa.midi_to_hz(midi_lo)
    t_lo = sample_rate/f_hi
    t_hi = sample_rate/f_lo
    
    r[:int(t_lo)] = 0
    r[int(t_hi):] = 0
    t_max = r.argmax()
    f0 = (sample_rate)/t_max
    return f0

In [17]:
print("Pitch for ED Audio: "+str(find_pitch(ed_audio))+" Hz")
print("Pitch for FLW-1 Audio: "+str(find_pitch(flw_1_audio))+" Hz")
print("Pitch for FLW-2 Audio: "+str(find_pitch(flw_2_audio))+" Hz")

Pitch for ED Audio: 11025.0 Hz
Pitch for FLW-1 Audio: 11025.0 Hz
Pitch for FLW-2 Audio: 11025.0 Hz


## Observation for Pitch Estimation

Pitch Estimates coming out to be exactly same for all audio files. Will need to test on other audio files to understand it's relevance.

# Feature Extraction: Spectral Bandwidth

Bandwidth is the difference between the upper and lower frequencies in a continuous band of frequencies. It is typically measured in hertz.

The below function computes bandwidth across many time frames.

In [18]:
# Helper Functions
'''
Creating Helper Function that calculates variance of order-p spectral bandwidth. 
    Param1: The Audio
    Param2: p
'''
def bandwidth_mean(audio, p):
    spectral_bandwidth = librosa.feature.spectral_bandwidth(audio+0.01, sr=sample_rate, p=p)[0]
    return (np.mean(spectral_bandwidth, dtype = np.float32))

def bandwidth_variance(audio, p):
    spectral_bandwidth = librosa.feature.spectral_bandwidth(audio+0.01, sr=sample_rate, p=p)[0]
    return (np.var(spectral_bandwidth, dtype = np.float32))

In [21]:
# Checking second order spectral bandwidths
print("2-order Spectral Bandwidth mean for ED Audio: "+str(bandwidth_mean(ed_audio, 2)))
print("2-order Spectral Bandwidth variance for ED Audio: "+str(bandwidth_variance(ed_audio, 2))+"\n\n")
print("2-order Spectral Bandwidth mean for FLW-1 Audio: "+str(bandwidth_mean(flw_1_audio, 2)))
print("2-order Spectral Bandwidth variance for FLW-1 Audio: "+str(bandwidth_variance(flw_1_audio, 2))+"\n\n")
print("2-order Spectral Bandwidth mean for FLW-2 Audio: "+str(bandwidth_mean(flw_2_audio, 2)))
print("2-order Spectral Bandwidth variance for FLW-2 Audio: "+str(bandwidth_variance(flw_2_audio, 2)))

2-order Spectral Bandwidth mean for ED Audio: 2286.667
2-order Spectral Bandwidth variance for ED Audio: 196597.48


2-order Spectral Bandwidth mean for FLW-1 Audio: 2191.14
2-order Spectral Bandwidth variance for FLW-1 Audio: 205327.84


2-order Spectral Bandwidth mean for FLW-2 Audio: 2201.5193
2-order Spectral Bandwidth variance for FLW-2 Audio: 240072.23


In [22]:
# Checking third order spectral bandwidths
print("3-order Spectral Bandwidth mean for ED Audio: "+str(bandwidth_mean(ed_audio, 3)))
print("3-order Spectral Bandwidth variance for ED Audio: "+str(bandwidth_variance(ed_audio, 3))+"\n\n")
print("3-order Spectral Bandwidth mean for FLW-1 Audio: "+str(bandwidth_mean(flw_1_audio, 3)))
print("3-order Spectral Bandwidth variance for FLW-1 Audio: "+str(bandwidth_variance(flw_1_audio, 3))+"\n\n")
print("3-order Spectral Bandwidth mean for FLW-2 Audio: "+str(bandwidth_mean(flw_2_audio, 3)))
print("3-order Spectral Bandwidth variance for FLW-2 Audio: "+str(bandwidth_variance(flw_2_audio, 3)))

3-order Spectral Bandwidth mean for ED Audio: 2884.796
3-order Spectral Bandwidth variance for ED Audio: 124205.945


3-order Spectral Bandwidth mean for FLW-1 Audio: 2797.852
3-order Spectral Bandwidth variance for FLW-1 Audio: 140705.19


3-order Spectral Bandwidth mean for FLW-2 Audio: 2784.9329
3-order Spectral Bandwidth variance for FLW-2 Audio: 172804.69


## Observation for Spectral Bandwidth

Higher the SUD, lower the bandwidth variance.

# Feature Extraction: Mel-Frequency Cepstral Coefficients (MFCC)

The MFCC is an immensely elaborate concept in itself, and thus might require extra reading to fully understand.
However, to put it in a nutshell, it can be best described as the **rate of change in spectral bands**, i.e, it's the rate of change in various frequencies that are present in the audio. For a detailed and mathematical understanding of MFCC, you check out https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd


In [28]:
# Helper Functions
'''
Creating Helper Function that visualizes the MFCC Values. 
    Param1: The MFCC Array
    Param2: The coefficient you wish to visualize
'''

'''
Creating Helper Function that calculates the Average of MFCC Values. 
    Param1: The MFCC Array
'''
def mfcc_mean(mfcc_value, coefficient_number):
    return np.mean(mfcc_value[coefficient_number])


'''
Creating Helper Function that calculates the Variance of MFCC Values. 
    Param1: The MFCC Array
    Param2: The coefficient
'''
def mfcc_variance(mfcc_value, coefficient_number):
    return np.var(mfcc_value[coefficient_number], dtype = np.float32)


In [31]:
# Extracting MFCC Values
mfccs_ed_audio = librosa.feature.mfcc(y=ed_audio, sr=sample_rate, n_mfcc=20)
mfccs_flw_1_audio = librosa.feature.mfcc(y=flw_1_audio, sr=sample_rate, n_mfcc=20)
mfccs_flw_2_audio = librosa.feature.mfcc(y=flw_2_audio, sr=sample_rate, n_mfcc=20)

In [33]:
# For ED Audio
print("For ED Audio: ")
print("Mean MFCC Values for 2nd Coefficent: "+str(mfcc_mean(mfccs_ed_audio, 2)))
print("MFCC Values Variance for 2nd Coefficent: "+str(mfcc_variance(mfccs_ed_audio, 2))+"\n")

# For FLW-1 Audio
print("For FLW-1 Audio: ")
print("Mean MFCC Values for 2nd Coefficent: "+str(mfcc_mean(mfccs_flw_1_audio, 2)))
print("MFCC Values Variance for 2nd Coefficent: "+str(mfcc_variance(mfccs_flw_1_audio, 2))+"\n")

# For FLW-2 Audio
print("For FLW-2 Audio: ")
print("Mean MFCC Values for 2nd Coefficent: "+str(mfcc_mean(mfccs_flw_2_audio, 2)))
print("MFCC Values Variance for 2nd Coefficent: "+str(mfcc_variance(mfccs_flw_2_audio, 2)))

For ED Audio: 
Mean MFCC Values for 2nd Coefficent: -4.792346
MFCC Values Variance for 2nd Coefficent: 405.71127

For FLW-1 Audio: 
Mean MFCC Values for 2nd Coefficent: -3.5600028
MFCC Values Variance for 2nd Coefficent: 439.09656

For FLW-2 Audio: 
Mean MFCC Values for 2nd Coefficent: -6.3759904
MFCC Values Variance for 2nd Coefficent: 552.8272


## We can do similar analysis for all 20 MFCC coefficients

In the above example, we have passed coeeficient as 2. We can do the same analysis for coefficents 0-19. For the second coefficient, we see that variance increases as SUD decreases.

# Feature Extraction: Zero Crossings (ZC)

The zero crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back. This feature has been used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock.


In [36]:
# Helper Functions
'''
Creating Helper Function that calculates the Mean of ZC Values. 
    Param1: The ZC Array
'''
def zc_mean(zero_crossings):
    return np.mean(zero_crossings, dtype = np.float32)
'''
Creating Helper Function that calculates the Variance of ZC Values. 
    Param1: The ZC Array
'''
def zc_variance(zero_crossings):
    return np.var(zero_crossings, dtype = np.float32)

In [38]:
# For ED Audio
zero_crossings_ed_audio = librosa.zero_crossings(ed_audio, pad=False)
print("Zero Crossings for ED Audio: "+str(sum(zero_crossings_ed_audio)))
print("Mean Zero Crossings for ED Audio: "+str(zc_mean(zero_crossings_ed_audio)))
print("Variance of Zero Crossings for ED Audio: "+str(zc_variance(zero_crossings_ed_audio))+"\n")

# For FLW-1 Audio
zero_crossings_flw_1_audio = librosa.zero_crossings(flw_1_audio, pad=False)
print("Zero Crossings for FLW-1 Audio: "+str(sum(zero_crossings_flw_1_audio)))
print("Mean Zero Crossings for FLW-1 Audio: "+str(zc_mean(zero_crossings_flw_1_audio)))
print("Variance of Zero Crossings for FLW-1 Audio: "+str(zc_variance(zero_crossings_flw_1_audio))+"\n")

# For FLW-2 Audio
zero_crossings_flw_2_audio = librosa.zero_crossings(flw_2_audio, pad=False)
print("Zero Crossings for FLW-2 Audio: "+str(sum(zero_crossings_flw_2_audio)))
print("Mean Zero Crossings for FLW-2 Audio: "+str(zc_mean(zero_crossings_flw_2_audio)))
print("Variance of Zero Crossings for FLW-2 Audio: "+str(zc_variance(zero_crossings_flw_2_audio)))

Zero Crossings for ED Audio: 289482
Mean Zero Crossings for ED Audio: 0.109551355
Variance of Zero Crossings for ED Audio: 0.09754987

Zero Crossings for FLW-1 Audio: 243075
Mean Zero Crossings for FLW-1 Audio: 0.09198912
Variance of Zero Crossings for FLW-1 Audio: 0.08352713

Zero Crossings for FLW-2 Audio: 271281
Mean Zero Crossings for FLW-2 Audio: 0.10266338
Variance of Zero Crossings for FLW-2 Audio: 0.092123665


## Conclusion for Zero Crossings (ZC):

Interesting stuff! ZC is a measure of noisiness of a signal. It's known to have HIGHER variance for music-like audio segments, and LOWER variance for speech-like audio segments. 

From our observation above, we can see that higher the SUD value, HIGHER is the variance, i.e, indicating a MORE NOISY audio as SUDs increase. I believe this is happening due to the patient breaking down and crying at higher SUDs, thereby leading to more noisiness.

# Feature Extraction: Spectral Centroid (SC)

The spectral centroid indicates at which frequency the energy of a spectrum is centered upon or in other words It indicates where the **center of mass** for a sound is located.

In [40]:
# Helper Functions
'''
Creating Helper Function that normalizes. 
'''

def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)

'''
Creating Helper Function that calculates mean of spectral centroids. 
'''
def centroid_mean(audio):
    spectral_centroids = librosa.feature.spectral_centroid(audio, sr=sample_rate)[0]
    return np.mean(spectral_centroids, dtype = np.float32)

'''
Creating Helper Function that calculates variance of spectral centroids. 
'''
def centroid_variance(audio):
    spectral_centroids = librosa.feature.spectral_centroid(audio, sr=sample_rate)[0]
    return np.var(spectral_centroids, dtype = np.float32)

In [42]:
# For ED Audio
print("Mean of Spectral Centroids for ED audio: "+str(centroid_mean(ed_audio)))
print("Variance of Spectral Centroids for ED audio: "+str(centroid_variance(ed_audio))+"\n")

# For FLW-1 Audio
print("Mean of Spectral Centroids for FLW-1 audio: "+str(centroid_mean(flw_1_audio)))
print("Variance of Spectral Centroids for FLW-1 audio: "+str(centroid_variance(flw_1_audio))+"\n")

# For FLW-2 Audio
print("Mean of Spectral Centroids for FLW-2 audio: "+str(centroid_mean(flw_2_audio)))
print("Variance of Spectral Centroids for FLW-2 audio: "+str(centroid_variance(flw_2_audio)))

Mean of Spectral Centroids for ED audio: 2249.0615
Variance of Spectral Centroids for ED audio: 728736.9

Mean of Spectral Centroids for FLW-1 audio: 2083.7686
Variance of Spectral Centroids for FLW-2 audio: 660445.3

Mean of Spectral Centroids for FLW-2 audio: 2168.2783
Variance of Spectral Centroids for FLW-2 audio: 830903.0


## Conclusion for Spectral Centroids (SC):

Spectral Centroid can be best thought of as the dominant frequency at a certain point of time. Now according to the Marmar paper, people with more monotonous speech (i.e, less variation in frequencies) are more prone to PTSD, but the results we got above are in direct contrast.

We see that, higher the SUD value, MORE is the variance in spectral centroid.

# Feature Extraction: Chroma 12 Pitch Scale 

In music, the term chroma feature or chromagram closely relates to the twelve different pitch classes.

The underlying observation is that humans perceive two musical pitches as similar in color if they differ by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. Assuming the equal-tempered scale, one considers twelve chroma values represented by the set

{C, C♯, D, D♯, E , F, F♯, G, G♯, A, A♯, B}

The Marmar paper has extensively used this particular feature, so I too extracted it to check how Chroma FFT values differ accross audio files of different SUD values.

In [52]:
# Helper Functions

'''
Creating Helper Function that calculates mean of the chromagram. 
    Param1: Path to audio file
    Param2: Coefficient Number
'''

def chroma_mean(audio, coefficient_number):
    chromagram = librosa.feature.chroma_stft(audio, sr=sample_rate, hop_length=512)
    return np.mean(chromagram[coefficient_number], dtype = np.float32)

'''
Creating Helper Function that calculates variance of the chromagram. 
    Param1: Path to audio file
    Param2: Coefficient Number
'''

def chroma_variance(audio, coefficient_number):
    chromagram = librosa.feature.chroma_stft(audio, sr=sample_rate, hop_length=512)
    return np.var(chromagram[coefficient_number], dtype = np.float32)


'''
Creating Helper Function that calculates kurtosis of the chromagram. 
    Param1: Path to audio file
    Param2: Coefficient Number
'''

def chroma_kurtosis(audio, coefficient_number):
    chromagram = librosa.feature.chroma_stft(audio, sr=sample_rate, hop_length=512)
    return kurtosis(chromagram[coefficient_number])

In [53]:
# For ED Audio
print("Mean of first Chroma FFT for ED Audio: "+str(chroma_mean(ed_audio, 1)))
print("Variance of first Chroma FFT for FLW-1 Audio: "+str(chroma_variance(ed_audio, 1))+"\n")

# For FLW-1 Audio
print("Mean of first Chroma FFT for FLW-1 Audio: "+str(chroma_mean(flw_1_audio, 1)))
print("Variance of first Chroma FFT for FLW-1 Audio: "+str(chroma_variance(flw_1_audio, 1))+"\n")

# For FLW-2 Audio
print("Mean of first Chroma FFT for FLW-2 Audio: "+str(chroma_mean(flw_2_audio, 1)))
print("Variance of first Chroma FFT for FLW-2 Audio: "+str(chroma_variance(flw_2_audio, 1)))

Mean of first Chroma FFT for ED Audio: 0.50464016
Variance of first Chroma FFT for FLW-1 Audio: 0.09402144

Mean of first Chroma FFT for FLW-1 Audio: 0.49902692
Variance of first Chroma FFT for FLW-1 Audio: 0.08909191

Mean of first Chroma FFT for FLW-2 Audio: 0.48722297
Variance of first Chroma FFT for FLW-2 Audio: 0.09145144


## We can do similar analysis for all 12 Chroma coefficients

In the above example, we have passed coeeficient as 1. We can do the same analysis for coefficents 0-11. For the first coefficient, we see that mean decreases as SUD decreases.

# Feature Extraction: RMS Energy

The energy of a signal corresponds to the total magntiude of the signal. For audio signals, that roughly corresponds to how loud the signal is. The RMS energy is just the root mean square of that energy.

In [54]:
# Helper Functions
'''
Creating Helper Function that calculates mean in RMSE values 
    Param1: Audio
'''
def rmse_mean(audio):
    hop_length = 256
    frame_length = 512

    rmse = librosa.feature.rms(audio, frame_length=frame_length, hop_length=hop_length, center=True)
    rmse = rmse[0]

    return np.mean(rmse, dtype = np.float32)   
    
'''
Creating Helper Function that calculates variance in RMSE values 
    Param1: Audio
'''
def rmse_variance(audio):
    hop_length = 256
    frame_length = 512

    rmse = librosa.feature.rms(audio, frame_length=frame_length, hop_length=hop_length, center=True)
    rmse = rmse[0]

    return np.var(rmse, dtype = np.float32)    

In [55]:
# For ED Audio
print("Mean RMSE for ED audio: "+str(rmse_mean(ed_audio)))
print("RMS Variance for ED audio: "+str(rmse_variance(ed_audio))+"\n")

# For FLW-1 Audio
print("Mean RMSE for FLW-1 audio: "+str(rmse_mean(flw_1_audio)))
print("RMS Variance for FLW-1 audio: "+str(rmse_variance(flw_1_audio))+"\n")

# For FLW-2 Audio
print("Mean RMSE for FLW-2 audio: "+str(rmse_mean(flw_2_audio)))
print("RMS Variance for FLW-2 audio: "+str(rmse_variance(flw_2_audio)))

Mean RMSE for ED audio: 0.0101291025
RMS Variance for ED audio: 9.459419e-05

Mean RMSE for FLW-1 audio: 0.008702356
RMS Variance for FLW-1 audio: 4.6746158e-05

Mean RMSE for FLW-2 audio: 0.0105046835
RMS Variance for FLW-2 audio: 9.3368464e-05


## Conclusion for RMS Energy:

It's difficult to point out any observations as such, since we see no general trend here. Maybe analyzing the audio of another session might help to discover some trend. 

Again, according to my intuition, RMS Energy's variance should DECREASE with higher SUDs. 

# Feature Extraction: Spectral Rolloff

Spectral rolloff is the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies.

Neither discussed in Marmar or Wiegersma papers, but I was curious to see how they vary for different SUD values.

In [57]:
# Helper Function
'''
Creating Helper Function that calculates mean of the calculated rolloffs. 
    Param1: Audio
'''

def rolloff_mean(audio):
    spectral_rolloff = librosa.feature.spectral_rolloff(audio+0.01, sr=sample_rate)[0]
    return np.mean(spectral_rolloff, dtype = np.float32)

'''
Creating Helper Function that calculates variance of the calculated rolloffs. 
    Param1: Audio
'''

def rolloff_variance(audio):
    spectral_rolloff = librosa.feature.spectral_rolloff(audio+0.01, sr=sample_rate)[0]
    return np.var(spectral_rolloff, dtype = np.float32)

In [58]:
# For ED Audio
print("Mean Rolloff for ED audio: "+str(rolloff_mean(ed_audio)))
print("Rolloff Variance for ED audio: "+str(rolloff_variance(ed_audio))+"\n")

# For FLW-1 Audio
print("Mean Rolloff for FLW-1 audio: "+str(rolloff_mean(flw_1_audio)))
print("Rolloff Variance for FLW-1 audio: "+str(rolloff_variance(flw_1_audio))+"\n")

# For FLW-2 Audio
print("Mean Rolloff for FLW-2 audio: "+str(rolloff_mean(flw_2_audio)))
print("Rolloff Variance for FLW-2 audio: "+str(rolloff_variance(flw_2_audio)))

Mean Rolloff for ED audio: 4112.696
Rolloff Variance for ED audio: 2892565.5

Mean Rolloff for FLW-1 audio: 3774.2007
Rolloff Variance for FLW-1 audio: 2456047.5

Mean Rolloff for FLW-2 audio: 3939.07
Rolloff Variance for FLW-2 audio: 2738746.5


## Conclusion for Spectral Rolloff (SR):

It's difficult to point out any observations as such, since we see no general trend here. Maybe analyzing the audio of another session might help to discover some trend. 

Again, according to my intuition, Rolloff variance should DECREASE with higher SUDs. 

# Feature Extraction: Prosodic Features

Prosody refers to a collection of acoustic features that concern intonation-related (pitch), loudness-related
(intensity),and tempo-related(e.g. durational aspects, speaking rate) features. This can closely contribute to meaning and may reveal information normally not captured by textual features, such as emotional state or attitude.

The Wiegersma paper has used this particular feature for a certain part of their experiment, so I too extracted it to check how prosodic features values differ accross audio files of different SUD values.

In [59]:
# Helper Functions
'''
Creating Helper Function that calculates Prosodic Features for given audio file. 
    Param1: Path to audio file
'''
def find_prosodic_features(path):
    prosodic_features = {
        "phonation_rate" : 0,
        "speech_productivity" : 0,
        "speech_rate" : 0,
        "articulation_rate" : 0
    }
    
    audio_for_prosody, sample_rate = librosa.load(path, duration=audio_duration) 
    
    # Finding voiced intervals by removing silent parts of the audio
    voiced_intervals = librosa.effects.split(y=audio_for_prosody, top_db=20)
    total_voiced_duration = 0
    for interval in voiced_intervals:
        total_voiced_duration = total_voiced_duration + ((interval[1]-interval[0])/sample_rate)
    
    # To account for overflows
    if total_voiced_duration > audio_duration:
        total_voiced_duration = audio_duration
        
    total_silenced_duration = audio_duration-total_voiced_duration

    prosodic_features["phonation_rate"] = total_voiced_duration/audio_duration
    prosodic_features["speech_productivity"] = (total_silenced_duration)/total_voiced_duration

    # Reading Audio file as source
    # listening the audio file and store in audio_text variable
    r = sr.Recognizer()
    demo = sr.AudioFile(path)
    with demo as source:

        audio = r.record(source, duration=50)

        # recognize_() method will throw a request error if the API is unreachable, hence using exception handling
        try:

            # using google speech recognition
            text = r.recognize_google(audio)
            num_words_spoken = len(text.split())
            prosodic_features["speech_rate"] = num_words_spoken/audio_duration
            prosodic_features["articulation_rate"] = num_words_spoken/total_voiced_duration

        except:
             print('Error in Calculating speech rate and articulation rate')
    
    return prosodic_features

In [60]:
# For ED Audio
min_sud_prosodic_features = find_prosodic_features(filename3)
print("Prosodic Features for ED Audio: \n"+str(min_sud_prosodic_features)+"\n")

# For FLW-1 Audio
medium_sud_prosodic_features = find_prosodic_features(filename2)
print("Prosodic Features for FLW-1 Audio: \n"+str(medium_sud_prosodic_features)+"\n")

# For FLW-2 Audio
high_sud_prosodic_features = find_prosodic_features(filename1)
print("Prosodic Features for FLW-2 Audio: \n"+str(high_sud_prosodic_features)+"\n")

Prosodic Features for ED Audio: 
{'phonation_rate': 0.39992249564037957, 'speech_productivity': 1.5004844961240322, 'speech_rate': 0.03337834237550862, 'articulation_rate': 0.08346202761627912}

Prosodic Features for FLW-1 Audio: 
{'phonation_rate': 0.6039527223406315, 'speech_productivity': 0.6557587423804947, 'speech_rate': 0.05006751356326294, 'articulation_rate': 0.0828997232916266}

Prosodic Features for FLW-2 Audio: 
{'phonation_rate': 0.4826583995349738, 'speech_productivity': 1.0718586912886392, 'speech_rate': 0.08344585593877156, 'articulation_rate': 0.17288802187876356}



## Conclusion for Prosodic Features:

These are features that I got to know about from the Wiegersma paper, and are really helpful for understanding the emotional state of the patient.

1. Phonation Rate: N(voiced minutes)/N(total minutes). The above observation shows that, for higher SUD values, the phonation rate drops, i.e, the patient is not able to speak much during higher SUDs.

2. Speech Productivity: N(silent minutes)/N(voiced minutes). The above observation shows that silent segments increase as SUD value increases.

3. Speech Rate: Words per minute. Speech rate decreased significantly for higher SUDs.

4. Articulation Rate: Words per voiced minute. Articulation rate decreased significantly for higher SUDs.

Indeed, these features give a good amount of insights into the emotional state of the patient!