BAD Robots IRL

Recognizing robot failure by analyzing human reactions and behaviors toward in-person robot failures.

The experiment involved a participant and a robot interacting or conversing with each other in a private room. The robot was controlled by a researcher and was engineered to create at least 3 errors.

The robot failure: not understanding the participant's order.

The robot failure was verbalized as "Sorry, I do not understand" and occurred at least 3 times. Afterward, the interaction ended with robot verbalizing "OK, I will call the researcher".

Feature Extraction

Feature extraction was performed on the participant to understand and analyze facial expressions, body movements, and speech that might convey underlying emotions during the human-robot interaction. After each feature extraction tool was applied to the sample's videos, resulting outputs were processed into readable forms which were then merged into one collective CSV file documenting all feature data per frame per participant.

Facial Features

The OpenFace toolkit was used to detect facial landmarks and action units and required a video file path as input. The output consisted of a CSV file containing feature data per frame for the video.

The OpenFace toolkit can be found here: https://github.com/TadasBaltrusaitis/OpenFace

Feature Exclusion

Irrelevant features were excluded from the OpenFace output and this process was completed while merging all feature data. Facial feature exclusion is found in this Python script: feature_merge.py. For the facial feature exclusion portion of the script, the CSV file path of the facial features for each participant and final facial feature list were required as input.

Final facial feature list (mainly action units):

facial_features = ['AU01_r', 'AU02_r', 'AU04_r', 'AU05_r', 'AU06_r', 'AU07_r', 'AU09_r', 'AU10_r', 'AU12_r',
'AU14_r', 'AU15_r', 'AU17_r', 'AU20_r', 'AU23_r', 'AU25_r', 'AU26_r',  'AU45_r', 'AU01_c', 'AU02_c', 'AU04_c',
'AU05_c', 'AU06_c', 'AU07_c', 'AU09_c', 'AU10_c', 'AU12_c', 'AU14_c', 'AU15_c', 'AU17_c', 'AU20_c', 'AU23_c',
'AU25_c', 'AU26_c', 'AU28_c', 'AU45_c']

Pose Features & Estimation

The OpenPose toolkit and the BODY_25 model were used to obtain keypoints of the participant's body features and required video file path as input and a JSON file path to store the output. The JSON files were parsed and converted into CSV files listing pose features per frame for each video file with this script: parse_openpose.py.

The following was executed in the command line interface to return a JSON file of 25 keypoints per frame:

bin\OpenPoseDemo.exe --video {input_video_file_path} --write_video {output_file_path} --write_json {output_file_path}

Feature Exclusion

Only upper body keypoints were relevant to the video dataset so the lower body keypoints (mid hip, right hip, left hip, right knee, left knee, right ankle, left ankle, right big toe, left big toe, right small toe, left small toe, right heel, left heel) were removed during preprocessing. The Python script used to exclude lower body features is found here: feature_exclusion.py. Required input for the script includes the directory path to the directory holding all participant CSV files and a CSV file path for each participant to store output.

Final pose feature list:

pose_features = ['nose_x', 'nose_y', 'neck_x', 'neck_y', 'rightshoulder_x', 'rightshoulder_y',
'leftshoulder_x', 'leftshoulder_y', 'rightelbow_x', 'rightelbow_y', 'leftelbow_x', 'leftelbow_y',
'rightwrist_x', 'rightwrist_y', 'leftwrist_x', 'leftwrist_y','righteye_x', 'righteye_y',
'lefteye_x', 'lefteye_y', 'rightear_x', 'rightear_y', 'leftear_x', 'leftear_y']

The OpenPose toolkit can be found here: https://github.com/CMU-Perceptual-Computing-Lab/openpose

Body Keypoint Delta Calculations

In addition to the original features produced with OpenPose, another column titled "[original feature name]_delta" was appended and included the change in value from the previous frame's original feature value to the current frame's original feature value. The Python script used for updating the CSV with delta values is found here: features_delta_calculations.py. Required input to obtain delta calculations for each participant includes the CSV file path of pose features for the participant and a CSV file path to store the output or additional delta columns and values.

Audio Features

The openSMILE toolkit and the eGeMAPSv02 configuration were used to obtain audio features from the interaction between the participant and the robot and required an audio file path as input and a CSV file path as output. The following Python code was executed to return a CSV file of audio features:

import opensmile
import os
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)
file = f"{input_audio_file_path}"
y = smile.process_file(file)
y.to_csv(f"{output_file_path}")

The script can also be found here: audio_extraction.py. Required inputs for the script included a directory of all participant audio files and a CSV file path for each participant to store output.

Final audio feature list:

audio_features = ['Loudness_sma3', 'alphaRatio_sma3', 'hammarbergIndex_sma3', 'slope0-500_sma3', 'slope500-1500_sma3',
'spectralFlux_sma3', 'mfcc1_sma3', 'mfcc2_sma3', 'mfcc3_sma3', 'mfcc4_sma3', 'F0semitoneFrom27.5Hz_sma3nz',
'jitterLocal_sma3nz', 'shimmerLocaldB_sma3nz,HNRdBACF_sma3nz', 'logRelF0-H1-H2_sma3nz', 'logRelF0-H1-A3_sma3nz',
'F1frequency_sma3nz', 'F1bandwidth_sma3nz', 'F1amplitudeLogRelF0_sma3nz', 'F2frequency_sma3nz', 'F2bandwidth_sma3nz',
'F2amplitudeLogRelF0_sma3nz', 'F3frequency_sma3nz', 'F3bandwidth_sma3nz', 'F3amplitudeLogRelF0_sma3nz']

Only audio segments (partitioned via speaker diarization) spoken by the participant were relevant to the study, other segments were removed during preprocessing.

The openSMILE toolkit can be found here: https://audeering.github.io/opensmile/about.html or https://github.com/audeering/opensmile/

Speaker Diarization

Speaker diarization was used to identify timestamps indicating when the participant and the robot were speaking.

pyannote speaker diarization toolkit was used to extract timestamps. The following Python code was executed to return an RTTM file of speaker timestamps from an audio file.

from pyannote.audio import Pipeline
import torch
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=f"{huggingface_authentication_token}")
diarization = pipeline(f"{input_audio_file_path}")
for turn, _, speaker in diarization.itertracks(yield_label=True):
        print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
    with open(f"{output_file_path}", "w") as rttm:
        diarization.write_rttm(rttm)

The output of the speaker diarization was an RTTM file. This was converted into a CSV file and "overlapped" with the openSMILE audio extraction CSV outputs to remove the timestamps at which other speakers were speaking.

pyannote speaker diarization toolkit can be found here: https://huggingface.co/pyannote/speaker-diarization-3.1 or https://github.com/pyannote/pyannote-audio

Once the timestamps for each speaker were extracted, audio features corresponding to the participant's speech were retained, while those corresponding to other speakers' speech were removed. The Python script used to achieve this is found here: filter_audio_features.py. Required inputs for the script to filter features for a single participant include the CSV file of openSMILE audio extracted features, the CSV file of timestamps produced from speaker diarization, and a CSV file path to store the output.

Timestamp to Frame Conversion

The CSV files of features achieved from openSMILE used start and end timestamps in the format "0 days 00:00:00.00", and each row included audio features for 0.02 seconds (between the start and end timestamps). However, this study is interested in utilizing frames instead of timestamps. Therefore, start timestamps were converted into frames. For multiple rows with the same frame number, the average feature value was calculated and used as the frame's feature value for each feature. The Python script used to convert timestamps to frames and process features is found here: audio_features_frames.py. Required inputs to obtain frames included a directory of all participant CSV files of filtered audio features and a CSV file path for each participant to store the new output CSV.

Feature Merge

After facial, pose, and audio features were processed to include relevant features per frame per participant, all features were merged into one CSV file representing the features per frame of a single participant. The Python script for merging features for each participant is found here: feature_merge.py

All participant's features were merged into a collective CSV file containing all rows from each participant's merged features data. The Python script for merging all participant feature data is found here: feature_all_participants.py.

Features were checked for NaN and inf values and then normalized for model training. The Python script for checking feature values is found here: check_features.py and for normalizing features is found here normalization.py

Labels

Two types of labeling methods were used: binary labeling and multiclass labeling. A label was applied to each frame of the video of the interaction between the participant and the robot. Labels were initially appended to the pose feature CSV files.

The Python script used for efficient labeling are found here: features_create_labels_csvs.py. Required inputs for the script to store binary and multiclass labels for each participant included a CSV file to store the features and labels, the frame numbers for when binary labels should be added, and the frame numbers for when multiclass labels should be added. Additionally, an input directory path was required for the location to store CSV files of features and labels for each participant.

Binary Labeling

"0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
"1" labeled frames from the first "Sorry, I do not understand" error to "OK, I will call the researcher".

Multiclass Labeling

"0" labeled frames from the beginning of the video to the first "Sorry, I do not understand" error and after "OK, I will call the researcher" to the end of the video.
"1" labeled frames from the first "Sorry, I do not understand" error to the second "Sorry, I do not understand".
"2" labeled frames from the second "Sorry, I do not understand" error to the third "Sorry, I do not understand".
"3" labeled frames from the third "Sorry, I do not understand" error to "OK, I will call the researcher".

The same labeling procedure was be used for additional errors.

Label Analysis

The following charts contain the amount and percentages of labels noted per participant.

Binary Labeling

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Total Number of Labels
2	569	48.9%	595	51.1%	1164
4	601	61.2%	381	38.8%	982
5	575	49.6%	584	50.4%	1159
6	619	51.9%	573	48.1%	1192
7	609	53.8%	524	46.2%	1133
8	606	59.8%	408	40.2%	1014
9	582	55.8%	461	44.2%	1043
10	589	48.3%	630	51.7%	1219
11	588	55.2%	477	44.8%	1065
12	615	45.8%	728	54.2%	1343
14	571	54.9%	469	45.1%	1040
16	577	43.0%	764	57.0%	1341
17	585	51.7%	546	48.3%	1131
18	577	45.1%	703	54.9%	1280
19	579	49.8%	583	50.2%	1162
20	595	44.4%	746	55.6%	1341
21	583	49.0%	607	51.0%	1190
22	571	52.1%	526	47.9%	1097
23	596	50.1%	593	49.9%	1189
25	577	50.9%	556	49.1%	1133
26	519	45.1%	631	54.9%	1150
27	608	39.2%	942	60.8%	1550
28	600	53.1%	531	46.9%	1131
29	517	32.6%	1069	67.4%	1586

Average percentage of "0" labels: 49.6%
Average percentage of "1" labels: 50.4%

Multiclass Labeling

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Label "2" Count	Label "2" Percentage	Label "3" Count	Label "3" Percentage	Total Number of Labels
2	569	48.9%	197	16.9%	202	17.4%	196	16.8%	1164
4	601	61.2%	132	13.4%	129	13.1%	120	12.2%	982
5	575	49.6%	221	19.1%	130	11.2%	233	20.1%	1159
6	619	51.9%	166	13.9%	147	12.3%	260	21.8%	1192
7	609	53.8%	159	14.0%	194	17.1%	171	15.1%	1133
8	606	59.8%	143	14.1%	136	13.4%	129	12.7%	1014
9	582	55.8%	136	13.0%	164	15.7%	161	15.4%	1043
10	589	48.3%	166	13.6%	275	22.6%	189	15.5%	1219
11	588	55.2%	140	13.1%	206	19.3%	131	12.3%	1065
12	615	45.8%	166	12.4%	165	12.3%	397	29.6%	1343
14	571	54.9%	147	14.1%	160	15.4%	162	15.6%	1040
16	577	43.0%	165	12.3%	263	19.6%	336	25.1%	1341
17	585	51.7%	155	13.7%	195	17.2%	196	17.3%	1131
18	577	45.1%	137	10.7%	187	14.6%	379	29.6%	1280
19	579	49.8%	145	12.5%	212	18.2%	226	19.4%	1162
20	595	44.4%	168	12.5%	276	20.6%	302	22.5%	1341
21	583	49.0%	156	13.1%	205	17.2%	246	20.7%	1190
22	571	52.1%	158	14.4%	162	14.8%	206	18.8%	1097
23	596	50.1%	130	10.9%	341	28.7%	122	10.3%	1189
25	577	50.9%	137	12.1%	218	19.2%	201	17.7%	1133
26	519	45.1%	170	14.8%	193	16.8%	268	23.3%	1150
27	608	39.2%	154	9.9%	298	19.2%	490	31.6%	1550
28	600	53.1%	161	14.2%	234	20.7%	136	12.0%	1131

Participant 29's interaction consisted of 5 errors. Therefore, it contained 2 additional labels "4" and "5".

participant	Label "0" Count	Label "0" Percentage	Label "1" Count	Label "1" Percentage	Label "2" Count	Label "2" Percentage	Label "3" Count	Label "3" Percentage	Label "4" Count	Label "4" Percentage	Label "5" Count	Label "5" Percentage	Total Number of Labels
29	511	32.2%	233	14.7%	167	10.5%	191	12.0%	223	14.1%	261	16.5%	1586

Average percentage of "0" label: 49.5%
Average percentage of "1" label: 12.7%
Average percentage of "2" label: 16.6%
Average percentage of "3" label: 19.0%
Average percentage of additional labels: 2.2%

The Python script that assisted with the label analysis is found here: label_analysis.py

Participant Exclusion

Participants were excluded based on the following reasons:

Failed protocol resulting in no reaction to failures
Distractions not involved in the experiment resulting in no reaction to failures
Feature extraction compound confidence scores below 0.50.

Final number of participants: 24.

Principal Component Analysis (PCA)

PCA is a method used to reduce the number of variables in a large dataset by retaining patterns in the data. PCA was conducted on the dataset of 84 features containing facial, pose, and audio features. The short script below was used to retain 90% of the variance and apply the PCA to the dataset.

participant_frames_labels = df.iloc[:, :4]
x = df.iloc[:, 4:]
x = StandardScaler().fit_transform(x.values)
pca = PCA()
principal_components = pca.fit_transform(x)
print(principal_components.shape)

pca = PCA(n_components=0.90)
principal_components = pca.fit_transform(x)
print(principal_components.shape)

principal_df = pd.DataFrame(data=principal_components, columns=['principal component ' + str(i) for i in range(principal_components.shape[1])])
principal_df = pd.concat([participant_frames_labels, principal_df], axis=1)

The script was embedded into the create_data_splits_pca.py method in create_data_splits.py.

The resulting dataframe consisted of 41 principal components.

Running PCA on pose, facial, and audio features separately yielded 13, 24, and 7 principal components, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
preprocessing		preprocessing
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_participants_merged_correct.csv		all_participants_merged_correct.csv
all_participants_merged_correct_normalized.csv		all_participants_merged_correct_normalized.csv
all_participants_merged_correct_principal.csv		all_participants_merged_correct_principal.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BAD Robots IRL

Table of Contents

Feature Extraction

Facial Features

Feature Exclusion

Pose Features & Estimation

Feature Exclusion

Body Keypoint Delta Calculations

Audio Features

Speaker Diarization

Timestamp to Frame Conversion

Feature Merge

Labels

Binary Labeling

Multiclass Labeling

Label Analysis

Binary Labeling

Multiclass Labeling

Participant Exclusion

Principal Component Analysis (PCA)

About

Releases

Packages

Languages

License

FAR-Lab/badrobotsIRL

Folders and files

Latest commit

History

Repository files navigation

BAD Robots IRL

Table of Contents

Feature Extraction

Facial Features

Feature Exclusion

Pose Features & Estimation

Feature Exclusion

Body Keypoint Delta Calculations

Audio Features

Speaker Diarization

Timestamp to Frame Conversion

Feature Merge

Labels

Binary Labeling

Multiclass Labeling

Label Analysis

Binary Labeling

Multiclass Labeling

Participant Exclusion

Principal Component Analysis (PCA)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages