# Checking IBM Watson ASR differences between the Vive headset microphone and a radio mic

Here's the original text:

*We present an immersive multi-person game developed for testing models of non-verbal behaviour in conversation. People interact in a virtual environment using avatars that are driven, by default, by their real-time head and hand movements. However, on the press of a button each participant's real movements can be substituted by "fake" avatar movements generated by algorithms. The object of the game is to score points in two ways a) by faking without being detected and b) by detecting when others are faking. This enables what amounts to a non-verbal Turing test in which the effectiveness of different algorithms for controlling non-verbal behaviour can be directly tested and evaluated in live interaction.*

## Audio files:

`edited/headset_mic.wav`

<audio controls src="edited/headset_mic.wav"></audio>

`edited/radio_mic.wav`

<audio controls src="edited/radio_mic.wav"></audio>

In [1]:
import concurrent.futures
import difflib
import functools
import os
import time

import deep_disfluency.asr.ibm_watson
import fluteline
import watson_streaming
import watson_streaming.utilities

In [2]:
CREDENTIALS_PATH = 'credentials.json'
WATSON_SETTINGS = {  # Copied from deep_disfluency_server
    'inactivity_timeout': -1,  # Don't kill me after 30 seconds
    'interim_results': True,
    'timestamps': True
}
AUDIO_DIR = 'emily_edited'
TARGET = '''
We present an immersive multi-person game developed for testing models
of non-verbal behaviour in conversation. People interact in a virtual
environment using avatars that are driven, by default, by their real-time
head and hand movements. However, on the press of a button each
participant's real movements can be substituted by "fake" avatar
movements generated by algorithms. The object of the game is to score
points in two ways a) by faking without being detected and b) by
detecting when others are faking. This enables what amounts to a
non-verbal Turing test in which the effectiveness of different algorithms
for controlling non-verbal behaviour can be directly tested and evaluated
in live interaction.
'''
TARGET = [''.join([c for c in word if c.isalnum()]) for word in TARGET.split()]

In [3]:
def queue_generator(queue):
    while not queue.empty():
        yield queue.get()

In [4]:
def call_watson(credentials, settings, audio_filepath):
    nodes = [
        watson_streaming.utilities.FileAudioGen(audio_filepath),
        watson_streaming.Transcriber(settings, credentials),
        deep_disfluency.asr.ibm_watson.IBMWatsonAdapter(),
    ]

    fluteline.connect(nodes)
    fluteline.start(nodes)
    time.sleep(80)  # The audio file is shorter than this
    fluteline.stop(nodes)
    
    return list(queue_generator(nodes[-1].output))

In [5]:
audio_filepaths = [os.path.join(AUDIO_DIR, a) for a in os.listdir(AUDIO_DIR)]

In [6]:
f = functools.partial(call_watson, CREDENTIALS_PATH, WATSON_SETTINGS)
with concurrent.futures.ThreadPoolExecutor() as executor:
    transcriptions = list(executor.map(f, audio_filepaths))

In [7]:
def dedup_words(iterable):
    known_ids = set()
    for item in iterable:
        id_ = item['id']
        if id_ not in known_ids:
            known_ids.add(id_)
            yield item['word']

In [8]:
first_appearances = [list(dedup_words(t)) for t in transcriptions]
last_appearances = [list(dedup_words(t[::-1]))[::-1] for t in transcriptions]

# Diff

## Code meaning

(copied from [python's difflib docs](https://docs.python.org/3.7/library/difflib.html#difflib.Differ))

- `-` line unique to sequence 1
- `+` line unique to sequence 2

In [9]:
def diff(a, b):
    for line in difflib.ndiff(a, b):
        line = line.strip()
        if line and not line.startswith('? '):
            yield line

In [10]:
print('Sequence order:', audio_filepaths)

Sequence order: ['emily_edited/headset_mic.wav', 'emily_edited/radio_mic.wav']


This is the diff of the first appearance of each word. Words have IDs, so any future occurance is ignored.

In [11]:
for line in diff(*first_appearances):
    print(line)

- is
- an
+ we
+ present
a
- among
+ a
massive
multi
- passing
+ person
game
- to
+ of
for
- testing
- are
+ teh
+ motor
+ from
non
+ vote
but
+ income
- but
- in
a
- people
+ uh
into
- in
- about
+ of
+ a
two
and
by
+ he's
+ avatar
+ the
+ drew
+ by
+ to
- using
- Apatow
- that
- try
- and
- I
- just
but
there
- field
+ built
time
had
moved
and
- Han
- mints
- however
+ Honda
+ moved
+ how
on
the
- price
+ press
of
a
button
each
- about
+ by
- his
+ is
- meet
+ move
can
be
so
by
- a
- on
- the
+ faith
+ of
+ it
Tom
+ is
- movements
- Jenna
- by
I
- great
- open
- of
+ go
+ over
+ it
the
game
is
- just
+ to
score
points
+ into
- and
- to
a
+ ways
a
by
faith
- without
+ with
being
- too
+ to
- he
+ tech
+ by
just
went
all
this
I
- it
- name
- to
+ this
+ in
+ a
amounts
to
be
in
- New
- bubble
+ on
+ the
to
test
- in
+ we
which
- he
+ the
effect
of
- did
+ the
I'll
go
controlling
non
- verbal
- behavior
+ vibe
+ but
can
be
- to
- to
+ directed
+ tied
in
about
value
I
and
rock


This is the diff of the last appearance of each word, which is obviously more "stable" and with therefore higher agreement between the mics.

In [12]:
for line in diff(*last_appearances):
    print(line)

- be
+ we
present
and
a
massive
multi
passing
game
developed
+ for
- protesting
+ testing
models
of
non
verbal
behavior
in
conversation
- people
+ uh
interact
in
a
virtual
environment
by
using
avatars
that
+ driven
- trend
- and
by
default
- by
+ but
- their
+ the
real
time
had
movements
and
- handmade
- mints
+ hand
+ movements
however
on
the
press
of
a
button
each
participants
will
movements
can
be
substituted
by
+ fake
+ avatar
- thank
- on
- the
- Tom
movements
generated
by
algorithms
the
object
of
the
game
is
to
score
points
in
two
ways
eight
by
faking
without
being
detected
- de
+ beat
+ by
dissecting
when
others
a
faking
- isn't
- able
- to
+ this
+ enables
+ what
amounts
to
be
a
non
- bubble
- cheering
+ verbal
+ Turing
test
in
which
the
effectiveness
of
different
algorithms
for
controlling
non
verbal
behavior
can
be
directly
tested
and
evaluated
- and
- light
+ in
+ life
rock
interaction


# Edit distance

The number of '+' and '-' lines in a diff.

In [13]:
def edit_distance(a, b):
    return len([_ for x in diff(a, b) if (x.startswith('+') or x.startswith('-'))])

In [14]:
edit_distance(*first_appearances)

107

In [15]:
edit_distance(first_appearances[0], TARGET)

161

In [16]:
edit_distance(first_appearances[1], TARGET)

160

In [17]:
edit_distance(*last_appearances)

41

In [18]:
edit_distance(last_appearances[0], TARGET)

79

In [19]:
edit_distance(last_appearances[1], TARGET)

58

# Word count

In [20]:
len(TARGET)

111

In [21]:
len(first_appearances[0])

120

In [22]:
len(first_appearances[1])

119

In [23]:
for first, last in zip(first_appearances, last_appearances):
    assert len(first) == len(last)