# Checking IBM Watson ASR differences between the Vive headset microphone and a radio mic

Here's the original text:

*We present an immersive multi-person game developed for testing models of non-verbal behaviour in conversation. People interact in a virtual environment using avatars that are driven, by default, by their real-time head and hand movements. However, on the press of a button each participant's real movements can be substituted by "fake" avatar movements generated by algorithms. The object of the game is to score points in two ways a) by faking without being detected and b) by detecting when others are faking. This enables what amounts to a non-verbal Turing test in which the effectiveness of different algorithms for controlling non-verbal behaviour can be directly tested and evaluated in live interaction.*

## Audio files:

`edited/headset_mic.wav`

<audio controls src="edited/headset_mic.wav"></audio>

`edited/radio_mic.wav`

<audio controls src="edited/radio_mic.wav"></audio>

In [1]:
import concurrent.futures
import difflib
import functools
import os
import time

import deep_disfluency.asr.ibm_watson
import fluteline
import watson_streaming
import watson_streaming.utilities

In [2]:
CREDENTIALS_PATH = 'credentials.json'
WATSON_SETTINGS = {  # Copied from deep_disfluency_server
    'inactivity_timeout': -1,  # Don't kill me after 30 seconds
    'interim_results': True,
    'timestamps': True
}
AUDIO_DIR = 'edited'
TARGET = '''
We present an immersive multi-person game developed for testing models
of non-verbal behaviour in conversation. People interact in a virtual
environment using avatars that are driven, by default, by their real-time
head and hand movements. However, on the press of a button each
participant's real movements can be substituted by "fake" avatar
movements generated by algorithms. The object of the game is to score
points in two ways a) by faking without being detected and b) by
detecting when others are faking. This enables what amounts to a
non-verbal Turing test in which the effectiveness of different algorithms
for controlling non-verbal behaviour can be directly tested and evaluated
in live interaction.
'''
TARGET = [''.join([c for c in word if c.isalnum()]) for word in TARGET.split()]

In [3]:
def queue_generator(queue):
    while not queue.empty():
        yield queue.get()

In [4]:
def call_watson(credentials, settings, audio_filepath):
    apikey, hostname = watson_streaming.utilities.config(credentials)

    nodes = [
        watson_streaming.utilities.FileAudioGen(audio_filepath),
        watson_streaming.Transcriber(settings, apikey, hostname),
        deep_disfluency.asr.ibm_watson.IBMWatsonAdapter(),
    ]

    fluteline.connect(nodes)
    fluteline.start(nodes)
    time.sleep(80)  # The audio file is shorter than this
    fluteline.stop(nodes)
    
    return list(queue_generator(nodes[-1].output))

In [5]:
audio_filepaths = [os.path.join(AUDIO_DIR, a) for a in os.listdir(AUDIO_DIR)]

In [6]:
f = functools.partial(call_watson, CREDENTIALS_PATH, WATSON_SETTINGS)
with concurrent.futures.ThreadPoolExecutor() as executor:
    transcriptions = list(executor.map(f, audio_filepaths))

In [7]:
def dedup_words(iterable):
    known_ids = set()
    for item in iterable:
        id_ = item['id']
        if id_ not in known_ids:
            known_ids.add(id_)
            yield item['word']

In [8]:
first_appearances = [list(dedup_words(t)) for t in transcriptions]
last_appearances = [list(dedup_words(t[::-1]))[::-1] for t in transcriptions]

# Diff

## Code meaning

(copied from [python's difflib docs](https://docs.python.org/3.7/library/difflib.html#difflib.Differ))

- `-` line unique to sequence 1
- `+` line unique to sequence 2

In [9]:
def diff(a, b):
    for line in difflib.ndiff(a, b):
        line = line.strip()
        if line and not line.startswith('? '):
            yield line

In [10]:
print('Sequence order:', audio_filepaths)

Sequence order: ['edited/headset_mic.wav', 'edited/radio_mic.wav']


This is the diff of the first appearance of each word. Words have IDs, so any future occurance is ignored.

In [11]:
for line in diff(*first_appearances):
    print(line)

+ with
+ present
+ man
- we
- pre
- any
- mass
more
- purpose
- gay
- give
- full
+ person
+ get
+ do
+ for
test
- mother
- from
- on
- the
- if
- you
+ model
+ of
+ until
+ billable
+ be
+ and
Congress
- a
+ say
in
in
of
- village
+ virtual
from
using
- of
+ I've
a
- that
+ thought
the
- tree
+ the
+ dream
+ driven
by
do
you
- their
+ the
real
time
- Hey
+ head
been
- and
+ panned
move
- my
- ever
+ uh
+ a
the
press
of
a
button
each
but
the
is
+ really
+ my
can
be
- sub
+ self
by
- for
+ Faye
+ I
+ a
+ time
of
- a
- Tom
- move
- my
+ month
+ J.
+ by
+ I
agree
- Jeff
the
- object
+ uh
of
the
game
is
is
to
- school
+ call
- points
+ point
in
- a
+ way
ways
a
may
- faking
+ face
without
being
the
- N.
- B.
- made
+ an
+ be
+ by
the
tech
I
- stuff
- fake
- you
- SMA
- a
- neighbor
+ self
+ faking
+ the
+ and
+ they
+ amount
+ of
+ be
what
- the
- mall
- be
I'm
to
an
non
verbal
touring
- S.
+ test
in
- which
+ reach
the
effect
if
of
the
I
agree
control
- and
- number
- can
+ in
+ no
+ will

This is the diff of the last appearance of each word, which is obviously more "stable" and with therefore higher agreement between the mics.

In [12]:
for line in diff(*last_appearances):
    print(line)

- we
- present
+ represent
any
massive
multi
person
game
developed
full
testing
models
of
non
verbal
behavior
in
conversation
people
interact
in
a
virtual
environment
using
+ I
with
us
that
the
+ uh
driven
by
default
by
the
real
time
head
and
hand
movement
however
on
the
press
of
a
button
each
- participant
- read
+ but
+ this
+ depends
+ really
movement
can
be
substituted
by
fake
of
- atonement
+ a
+ time
+ of
+ month
generated
by
+ algorithms
+ Jeff
- I
- agree
- get
the
object
of
the
game
is
is
to
score
points
in
two
ways
a
may
faking
without
being
detected
- N.
- B.
+ and
+ being
by
detecting
- one
+ when
- other
+ others
+ are
+ faking
+ this
+ enables
- stuff
- thank
- you
- esta
- neighbors
what
amount
to
be
what
- amount
+ amounts
to
a
non
verbal
Turing
test
in
- which
+ reach
the
effect
effectiveness
of
different
algorithms
for
controlling
nonverbal
behavior
can
be
directly
tested
and
evaluated
in
life
- tracks
+ in
- interaction
+ traction


# Edit distance

The number of '+' and '-' lines in a diff.

In [13]:
def edit_distance(a, b):
    return len([_ for x in diff(a, b) if (x.startswith('+') or x.startswith('-'))])

In [14]:
edit_distance(*first_appearances)

120

In [15]:
edit_distance(first_appearances[0], TARGET)

173

In [16]:
edit_distance(first_appearances[1], TARGET)

163

In [17]:
edit_distance(*last_appearances)

46

In [18]:
edit_distance(last_appearances[0], TARGET)

83

In [19]:
edit_distance(last_appearances[1], TARGET)

77

# Word count

In [21]:
len(TARGET)

111

In [23]:
len(first_appearances[0])

126

In [25]:
len(first_appearances[1])

130

In [29]:
for first, last in zip(first_appearances, last_appearances):
    assert len(first) == len(last)