# Checking IBM Watson ASR differences between the Vive headset microphone and a radio mic

Here's the original text:

*We present an immersive multi-person game developed for testing models of non-verbal behaviour in conversation. People interact in a virtual environment using avatars that are driven, by default, by their real-time head and hand movements. However, on the press of a button each participant's real movements can be substituted by "fake" avatar movements generated by algorithms. The object of the game is to score points in two ways a) by faking without being detected and b) by detecting when others are faking. This enables what amounts to a non-verbal Turing test in which the effectiveness of different algorithms for controlling non-verbal behaviour can be directly tested and evaluated in live interaction.*

## Audio files:

`edited/headset_mic.wav`

<audio controls src="edited/headset_mic.wav"></audio>

`edited/radio_mic.wav`

<audio controls src="edited/radio_mic.wav"></audio>

In [1]:
import concurrent.futures
import difflib
import functools
import os
import time

import deep_disfluency.asr.ibm_watson
import fluteline
import watson_streaming
import watson_streaming.utilities

In [2]:
CREDENTIALS_PATH = 'credentials.json'
WATSON_SETTINGS = {  # Copied from deep_disfluency_server
    'inactivity_timeout': -1,  # Don't kill me after 30 seconds
    'interim_results': True,
    'timestamps': True
}
AUDIO_DIR = 'edited'

In [3]:
def queue_generator(queue):
    while not queue.empty():
        yield queue.get()

In [4]:
def call_watson(credentials, settings, audio_filepath):
    apikey, hostname = watson_streaming.utilities.config(credentials)

    nodes = [
        watson_streaming.utilities.FileAudioGen(audio_filepath),
        watson_streaming.Transcriber(settings, apikey, hostname),
        deep_disfluency.asr.ibm_watson.IBMWatsonAdapter(),
    ]

    fluteline.connect(nodes)
    fluteline.start(nodes)
    time.sleep(70)  # The audio file is shorter than this
    fluteline.stop(nodes)
    
    return list(queue_generator(nodes[-1].output))

In [5]:
audio_filepaths = [os.path.join(AUDIO_DIR, a) for a in os.listdir(AUDIO_DIR)]

In [6]:
f = functools.partial(call_watson, CREDENTIALS_PATH, WATSON_SETTINGS)
with concurrent.futures.ThreadPoolExecutor() as executor:
    transcriptions = list(executor.map(f, audio_filepaths))

In [7]:
def dedup_words(iterable):
    known_ids = set()
    for item in iterable:
        id_ = item['id']
        if id_ not in known_ids:
            known_ids.add(id_)
            yield item['word']

In [8]:
first_appearances = [list(dedup_words(t)) for t in transcriptions]
last_appearances = [list(dedup_words(t[::-1]))[::-1] for t in transcriptions]

In [9]:
def print_diff(a, b):
    for line in difflib.ndiff(a, b):
        line = line.strip()
        if line and not line.startswith('? '):
            print(line)

## Code meaning

(copied from [python's difflib docs](https://docs.python.org/3.7/library/difflib.html#difflib.Differ))

- `-` line unique to sequence 1
- `+` line unique to sequence 2

In [10]:
print('Sequence order:', audio_filepaths)

Sequence order: ['edited/headset_mic.wav', 'edited/radio_mic.wav']


This is the diff of the first appearance of each word. Words have IDs, so any future occurance is ignored.

In [11]:
print_diff(*first_appearances)

- we
+ with
pre
+ mouse
- any
- mass
multi
purpose
- and
- give
+ game
+ do
for
- testing
- more
- for
+ test
+ among
+ from
on
the
- if
- you
+ be
+ in
Congo
- a
+ say
enter
in
a
village
from
+ to
use
+ with
- of
- a
dust
+ that
the
+ uh
driven
by
do
you
- their
+ the
real
time
had
- to
- and
+ the
+ panned
move
my
- ever
+ on
the
press
of
a
button
it
but
this
is
+ really
+ movement
can
be
- sub
+ self
by
- fake
+ for
of
a
- town
- move
+ Tom
+ of
my
+ Jenna
+ by
+ a
agree
- Joe
+ of
the
- object
- of
the
game
is
is
to
call
- on
+ part
in
two
ways
a
- may
+ brief
faking
with
a
the
and
being
by
the
- tech
- other
+ when
+ I
stuff
fake
+ this
+ and
+ for
- you
- SM
- a
- neighbor
- saw
among
to
be
- amount
+ what
+ I
to
a
non
vote
to
- S.
+ test
in
we
- the
+ if
effect
affect
- even
+ in
the
- agreed
- in
+ I'll
+ go
control
in
verbal
- can
+ will
be
- doing
+ dry
test
and
if
I
life
and
track


This is the diff of the last appearance of each word, which is obviously more "stable" and with therefore higher agreement between the mics.

In [12]:
print_diff(*last_appearances)

- we
- present
+ represent
any
massive
multi
person
game
developed
full
testing
models
of
non
verbal
behavior
in
conversation
people
interact
in
a
virtual
environment
using
+ I
with
us
that
the
+ uh
driven
by
default
by
the
real
time
head
and
hand
movement
however
on
the
press
of
a
button
each
- participant
- read
+ but
+ this
+ depends
+ really
movement
can
be
substituted
by
fake
of
- atonement
+ a
+ time
+ of
+ month
generated
by
+ algorithms
- I
- agree
- yeah
the
object
of
the
game
is
is
to
score
points
in
two
ways
a
may
faking
without
being
detected
- N.
- B.
+ and
+ being
by
detecting
- one
+ when
- other
+ others
+ are
+ faking
+ this
+ enables
- stuff
- thank
- you
- esta
- neighbors
what
amount
to
be
what
- amount
+ amounts
to
a
non
verbal
Turing
test
in
- which
+ reach
the
effect
effectiveness
of
different
algorithms
for
controlling
nonverbal
behavior
can
be
directly
tested
and
evaluated
in
life
- track
+ in
- interaction
+ traction
