In [3]:
# Question: how well does whisper local predict phrase finality (i.e. ".", "?", "!")?

# Strategy
# I will say a mixture of questions, statements, phrases with various intonations and record the words said, the intonation used, 
# and the raw transcription, paying close attention to the punctuation added.

# Results:

# QUESTION PROMPTS
# 1. What is the weather outside like today? 
#   - Intonation: Falling - mid mid low high mid mid low
#   - Result: 
#       1. What is the weather outside?
#       2. What is the weather outside like to
#       3. What is the weather outside?
#       4. What is the weather outside like?
#       5. What is the weather outside like today?
# 2. How are you?
#   - Intonation: Falling - mid mid low
#   - Result:
#       1. [No Words Detected]
#       2. How are you?
#       3. How are you?
#       4. How are you?
#       5. How are you?
# 3. What is your favorite color?
#   - Intonation: Falling - mid mid low high mid
#   - Result:
#       1. What is your favorite color?
#       2. What is your favorite color?
#       3. What is your favorite color?
#       4. What is your favorite color?
#       5. What is your favorite color?
# 4. And you also like to surf?
#   - Intonation: Rising - low mid high high mid high
#   - Result:
#       1. And you also like to surf?
#       2. and you also like to surf?
#       3. And you also like to serve?
#       4. And you also like to surf?
#       5. And you also like to surf?
# 5. What are your plans this wednesday?
#   - Intonation: Neutral - mid low high mid mid mid
#   - Result:
#       1. What are your plans this Wednesday?
#       2. your plans this Wednesday.
#       3. What are your plans this Wednesday?
#       4. What are your plans this Wednesday?
#       5. your plans this way.

# STATEMENT PROMPTS
# 1. I like big butts and I cannot lie.
#   - Intonation: Falling - mid mid mid highs low low mid low
#   - Result:
#       1. I like big buds.
#       2. I like big buds and
#       3. I like big butts and I like...
#       4. I like big butts and I cannot lie.
#       5. I like big butts and I...
# 2. And yet, knowledge is obviously power.
#   - Intonation: Falling - mid high, high mid mid low
#   - Result:
#       1. And yet, knowledge is obviously power.
#       2. The idea, knowledge is...
#       3. And yet, knowledge is obviously power.
#       4. And yet, knowledge is obviously power.
#       5. And yet knowledge is obviously power.
# 3. I flew.
#   - Intonation: Falling - high low
#   - Result:
#       1. I flew.
#       2. I flew.
#       3. I flew.
#       4. I flew.
#       5. I flew.
# 4. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   - Intonation: Falling - mid high high high low mid low high low mid mid high mid mid low low mid low
#   - Result:
#       1. The fitness gram pacertest is a cardiovascular test that gradually picks up.
#       2. The fitness grand pacer test.
#       3. The Fitness Grand Pacer Test is a cardiovascular test that gradually picks up pace as the test progresses.
#       4. The fitness gram PASER test is a cardiovascular
#       5. The Fitness Grim Pacer Test is a cardiovascular test that gradually picks up pace as the test progresses.
# 5. When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   - Intonation: Falling - mid low high low high mid, low high mid, mid low mid high mid low high mid, low low mid high mid, high high, mid mid 
#        mid mid mid low.
#   - Result:
#       1. When the mood is just right, on windy days, I would rather sit inside.
#       2. When the mood is just right on windy days, I would rather sit inside.
#       3. when the mood is just right on windy days.
#       4. When the mood is just right on windy days
#       5. When the mood is just right on windy days, I would rather sit inside

# Phrases (non-complete sentences)
# 1. The muffin man keeps his
#   - Terminal Intonation: Flat
#   - Result:
#       1. The Muffin Man keeps his.
#       2. The Muffin man keeps his.
#       3. The Muffin man keeps his...
#       4. The Mothelman keeps his.
#       5. The muffin man keeps his
# 2. I don't know about you, but
#   - Intonation: neutral
#   - Result:
#       1. I don't know about you, but...
#       2. I don't know about you, but...
#       3. I don't know about you.
#       4. I don't know about you.
#       5. I don't know about you.
# 3. I've got this weird
#   - Intonation: Rising
#   - Result:
#       1. I've got this weird
#       2. I've got this weird
#       3. I've got this weird
#       4. this weird
#       5. I've got this weird
# 4. In the morning
#   - Intonation: Falling
#   - Result:
#       1. in the morning.
#       2. in the morning.
#       3. in the morning.
#       4. in the morning.
#       5. in the morning.
#   - Intonation: Rising
#       1. In the morning.
#       2. in the morning
#       3. in the morning.
#       4. in the morning
#       5. in the morning.
# 5. The answer to that question is
#   - Intonation: Rising
#   - Result:
#       1. the answer to that question is...
#       2. The answer to that question is
#       3. The answer to that question is
#       4. The answer to that question is...
#       5. The answer to that question is

# Conclusion: 
# The "base" Whisper model heavily errors on the side of adding too much punctuation. 
# -----------------------------------------------------------------------------------
#   Many of the phrases, which were not complete sentences, were interpreted with a period. I made sure to emphasize the intonation, heavily
#   implying, to the best of my ability, that the phrase was not a complete thought and more should be said. However, the model was quick to terminate these
#   input windows and transcribe the phrase into a sentence with a period.  

#   I even made it a point to use different types of intonation (see Phrases -> "In the morning") to ensure that I was using the "correct" intonation that
#   did imply the phrase was an incomplete thought, and I still got similar results. The rising intonation in this case did produce 2/5 correct responses,
#   but this is still very poor and is too small of a population to be statistically significant.
# 
# 
# The model also tends to terminate voice input in the middle of longer sentences
# -------------------------------------------------------------------------------
#   To the sentence "When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#   dogs for a walk.", it was clear that the model was waiting for a natural stopping point in the sentence and stopped listening. I even tried speeding up my
#   speech slightly around these "problem areas", but the model was consistent in terminating the input early each time.

#   It wasn't just doing this against long, runon sentences with many clauses. It also did this to the sentence: "The fitness gram pacer test is a cardiovascular 
#   test that gradually picks up pace as the test progresses." This sentence is simple and direct, however the model still struggled to extend the input window
#   long enough in 3 of 5 trials. After the first miss, I made sure to increase speech speed slight in the cutoff zone, and I still have instances of cutoffs after.

# Follow-up Questions:
#   1. Does the "base.en" model perform better in the problematic areas mentioned above?
#   2. Does the "small" model perform better in the problematic areas mentioned above?

In [4]:
import speech_recognition as sr

# Setup speech recognition
recognizer = sr.Recognizer()

# Function to capture speech and convert to text
def speech_to_text():
    with sr.Microphone() as source:
        print("Listening...")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)
        
    try:
        text = recognizer.recognize_whisper(audio, model="base") # Works offline!
        print(f"You said: {text}")
        return text
    except sr.UnknownValueError:
        print("Sorry, I couldn't understand that.")
        return None
    except sr.RequestError:
        print("Speech service unavailable")
        return None

In [97]:
speech_to_text()

ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...
You said:  The answer to that question is


' The answer to that question is'

In [None]:
# follow up questions due diligence:

# Problematic word sequences:
#    I like big butts and I cannot lie.
#    The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#    When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#          dogs for a walk.
#    In the morning
#    I don't know about you, but
with sr.Microphone() as source:
    print("Listening...")
    recognizer.adjust_for_ambient_noise(source)
    audio = recognizer.listen(source)
    
try:
    text = recognizer.recognize_whisper(audio, model="base.en") # Works offline!
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Sorry, I couldn't understand that.")
except sr.RequestError:
    print("Speech service unavailable")

ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...
You said:  l can't know about you...


In [None]:
# I like big butts and I cannot lie.
#   1. I like big butts and I can't...
#   2. I like big butts and I
#   3. I like a big butt and I
#   4. I like big butts and I cannot lie.
#   5. I like big butts.

# The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   1. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   2. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   3. The fitness gram pacer test is a cardiovascular
#   4. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   5. the fitness gram pacer test.

# When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   1. When they mood is just right on windy days, I would rather sit inside
#   2. When the mood is just right on windy days
#   3. When the mood is just right on windy days, I would rather sit inside
#   4. when the mood is just right on windy days.
#   5. When the mood is just right on windy days, I

# In the morning (rising intonation)
#   1. in the morning.
#   2. in the morning.
#   3. in the morning.
#   4. in the morning.
#   5. in the morning # NOTE: I drew this one out longer than the rest

# I don't know about you, but
#   1. I don't know about you, but...
#   2. I don't know about you, but...
#   3. I don't know about you, but...
#   4. I don't know about you, but...
#   5. l can't know about you...

# Conclusion:
#   There were two problematic behaviors that the "base" model produced:
#     1. The model was overly sensitive to input termination
#     2. The model was overly optimistic about adding periods.

#   The "base.en" model was better at both of these areas. The "base.en" model also was better at transcribing words such as "butts" and "fitness gram pacer",
#   which the "base" model struggled with. The "base.en" did still struggle with "in the morning", and the only time it showed understanding of intonation here
#   was when I exaggerated the hold-out of the word "morning", heavily implying I am not finished completing my thought.

In [150]:
# Problematic word sequences:
#    I like big butts and I cannot lie.
#    The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#    When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#          dogs for a walk.
#    In the morning
#    I don't know about you, but
with sr.Microphone() as source:
    print("Listening...")
    recognizer.adjust_for_ambient_noise(source)
    audio = recognizer.listen(source)
    
try:
    text = recognizer.recognize_whisper(audio, model="small") # Works offline!
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Sorry, I couldn't understand that.")
except sr.RequestError:
    print("Speech service unavailable")

ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...
You said:  I don't know about you.


In [None]:
# "small" model results:

# I like big butts and I cannot lie.
#   1. I like big butts and I...
#   2. I like big butts and I cannot lie.
#   3. I like big butts and I-
#   4. I like big butts and I...
#   5. I like big butts and I cannot lie.

# The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   1. The fitness gram-pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   2. The fitness gram-pacer test is a cardiovascular test that grows
#   3. The fitness gram pacer test is a cardiovascular test that gradually picks up pay.
#   4. The fitness gram pacer test is a cardiovascular test.
#   5. The Fitness Gram Pacer Test is a cardiovascular test.

# When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   1. When the mood is just right.
#   2. When the mood is just right on windy days, I would rather sit inside
#   3. When the mood is just right on windy days, I would rather sit inside
#   4. When the mood is just right on windy days, I would rather sit inside and
#   5. When the mood is just right on windy days, I would rather sit inside and play.

# In the morning (rising intonation)
#   1. in the morning
#   2. in the morning
#   3. in the morning
#   4. in the morning.
#   5. in the morning.

# I don't know about you, but
#   1. I don't know about you.
#   2. I don't know about you, but...
#   3. I don't know about you, but...
#   4. I don't know about you, but...
#   5. I don't know about you.

# Conclusion
#   There were two problematic behaviors that the "base" model produced:
#     1. The model was overly sensitive to input termination
#     2. The model was overly optimistic about adding periods.

#   The "small" model was slightly better at keeping the input window open and was much better at avoiding overly transcribing termination periods.
#   Unsurprisingly, the model also was better at transcribing the words that the base model struggled with, such as "butts", "fitness gram pacer"
#   This was also the first time I saw the model terminate transcription with a "-".

#   Although this model was still overly sensitive to early termination, especially in the "...I would rather sit inside..." trials, it did show
#   much better understanding of phrasing context and intonation and did recognize that even though it cut voice input, the transcription output was
#   NOT a complete thought, and did not add a terminating period, which is very promising.

In [154]:
# Follow-up Questions:
#   Since both "base.en" and "small", which theoretically should have made some improvement over the results of the "base" model, showed signs of improvement,
#   will it also track that "small.en" will further improve these results?

#   Given that the "small" model seemed to have a better understanding of intonation and phrase context but struggled more with terminating voice input window 
#   early, is there a configuration setting in the model that can increase the threshold regarding when to stop the input window?

In [184]:
# Follow up question code:

with sr.Microphone() as source:
    print("Listening...")
    recognizer.adjust_for_ambient_noise(source)
    audio = recognizer.listen(source)
    
try:
    text = recognizer.recognize_whisper(audio, model="base.en") # Works offline!
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Sorry, I couldn't understand that.")
except sr.RequestError:
    print("Speech service unavailable")


ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...


KeyboardInterrupt: 

In [181]:
# "small.en" model results:

# I like big butts and I cannot lie.
#   1. I like big butts and I cannot lie.
#   2. I like big butts and I cannot lie.
#   3. I like big butts and I...
#   4. I like big butts and I cannot lie.
#   5. I like big butts and I cannot lie.

# The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   1. The FitnessGram Pacer Test is a cardiovascular test.
#   2. The FitnessGram Pacer Test
#   3. The fitness grim pacer test
#   4. The fitness gram pacer test is a cardiovascular test
#   5. The fitness gram pacer test

# When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   1. When the mood is just right, on windy days, I would rather sit inside and play Minecraft.
#   2. When the mood is just right on windy days, I would rather sit inside a-
#   3. When the mood is just right on windy days, I would rather sit inside and play Minecraft.
#   4. When the mood is just right on windy days, I would rather sit inside and play my
#   5. When the mood is just right on windy days, I would rather sit inside and play my-

# In the morning (rising intonation)
#   1. In the morning?
#   2. in the morning.
#   3. in the morning?
#   4. in the morning
#   5. In the morning?

# I don't know about you, but
#   1. I don't know about you.
#   2. I don't know about you, but...
#   3. I don't know about you, but...
#   4. I don't know about you, but...
#   5. I don't know about you, but...

In [None]:
# Conclusion
# Again, the two metrics to address are:
#   1. accuracy of phrase/sentence termination with a period or question mark
#   2. sensitivity to terminating the input window
# The "small.en" model seemed to be less accurate at determining whether a sentence/phrase is a complete thought (point 1) 
#   compared to both the "small" and "base.en" models
# The "small.en" model also seemed to be similarly trigger happy to terminate the input window before a complete thought
#   was spoken. I also adjusted my intonation and cadence to try to emphasize that I was not finished speaking at the 
#   problematic areas, but didn't see any change in results, specifically "...on windy days..." and "the fitness gram pacer..."

# The base.en model seems to be most suitable, as it has the highest performance in the areas I most care about, and it also
#   has lower compute requirements. "base.en" can evaluate in real time, while "small" lags behind a considerable amount.

# machine specs:
#   32GB DDR5
#   AMD Ryzen 9 7950x 16 cores @ 4.5 GHz
#   NVIDIA GeForce RTX 3070



# Follow-up Question:
# Does google's online model do better in these problem areas?

In [None]:
# Follow up question code:

with sr.Microphone() as source:
    print("Listening...")
    recognizer.adjust_for_ambient_noise(source)
    audio = recognizer.listen(source)
    
try:
    text = recognizer.recognize_google(audio)
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Sorry, I couldn't understand that.")
except sr.RequestError:
    print("Speech service unavailable")

ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...
You said: I don't know about you but


In [None]:

# I like big butts and I cannot lie.
#   1. I like big butts and I cannot lie
#   2. I like big butts and I cannot lie
#   3. I like big butts and I cannot lie
#   4. I like big butts and
#   5. do you like big butts and I cannot

# The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   1. the fitnessgram Pacer test is a cardiovascular
#   2. the fitnessgram Pacer
#   3. the fitnessgram Pacer test is a cardiovascular
#   4. did the fitnessgram Pacer test is a cardiovascular
#   5. the fitnessgram Pacer

# When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   1. when the mood is just right on Windy days
#   2. when the mood is just right on Windy days I would rather sit inside and play Minecraft
#   3. when the mood is just right on Windy days I would rather sit
#   4. when the mood is just right on Windy days I would rather sit
#   5. when the mood is just right on Windy days I would rather sit inside and play Minecraft

# In the morning (rising intonation)
#   1. in the morning
#   2. in the morning
#   3. in the morning
#   4. in the morning
#   5. in the morning

# I don't know about you, but
#   1. I don't know about you
#   2. I don't know about you but
#   3. I don't know about you but
#   4. I don't know about you but
#   5. I don't know about you but


# Conclusion:
# No the google recognition service does not add terminating punctuation, which makes this model weak for my specific use case. Additionally, 
# I just realized the `Recognizer.listen()` command is solely responsible for audio capture timing.


# Follow-up Questions:
# Would doubling the `pause_threshold` and the `non_speaking_duration` increase the accuracy of the timing of the audio capture? Specifically,
# looking at the longer two examples: "The fitness gram pacer test..." and "When the mood is just right..."

In [34]:
# Which microphone was I using for all of these experiments?
import pyaudio

p = pyaudio.PyAudio()

# Get default input device info
default_device = p.get_default_input_device_info()
default_index = int(default_device['index'])

print(f"Default microphone: Index {default_index} - {default_device['name']}")
print("-" * 50)

# List all audio input devices
print("All available audio input devices:")
for i in range(p.get_device_count()):
    dev_info = p.get_device_info_by_index(i)
    if dev_info["maxInputChannels"] > 0:  # If it has input channels, it's a microphone
        is_default = " (DEFAULT)" if i == default_index else ""
        print(f"Index: {i}{is_default}, Name: {dev_info['name']}")
        print(f"  Input channels: {dev_info['maxInputChannels']}")
        print(f"  Default sample rate: {dev_info['defaultSampleRate']}")

p.terminate()

Default microphone: Index 13 - default
--------------------------------------------------
All available audio input devices:
Index: 6, Name: HD-Audio Generic: ALC897 Analog (hw:2,0)
  Input channels: 2
  Default sample rate: 44100.0
Index: 8, Name: HD-Audio Generic: ALC897 Alt Analog (hw:2,2)
  Input channels: 2
  Default sample rate: 44100.0
Index: 9, Name:  SMY18: USB Audio (hw:3,0)
  Input channels: 2
  Default sample rate: 44100.0
Index: 10, Name: Q9-1: USB Audio (hw:4,0)
  Input channels: 1
  Default sample rate: 44100.0
Index: 12, Name: pulse
  Input channels: 32
  Default sample rate: 44100.0
Index: 13 (DEFAULT), Name: default
  Input channels: 32
  Default sample rate: 44100.0


ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave


In [48]:
recognizer = sr.Recognizer()
with sr.Microphone() as source:
    print("Listening...")
    recognizer.adjust_for_ambient_noise(source)
    recognizer.pause_threshold *= 2
    recognizer.non_speaking_duration *= 2
    audio = recognizer.listen(source)
    
try:
    text = recognizer.recognize_whisper(audio, model="base.en") # Works offline!
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Sorry, I couldn't understand that.")
except sr.RequestError:
    print("Speech service unavailable")


ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2664:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib pcm_oss.c:397:(_snd_pcm_oss_open) Cannot open device /dev/dsp
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib confmisc.c:160:(snd_config_get_card) Invalid field card
ALSA lib pcm_usb_stream.c:482:(_snd_pcm_usb_stream_open) Invalid card 'card'
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dsnoop.c:601:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1032:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2664

Listening...
You said:  The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.


In [49]:
# Results (doubling `pause_threshold` & `non_speaking_duration`:

# The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   1. The fitness-gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   2. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   3. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.
#   4. The fitness gram pacer test is a cardiovascular test that gradually picks off pace as the test progresses.
#   5. The fitness gram pacer test is a cardiovascular test that gradually picks up pace as the test progresses.

# When the mood is just right, on windy days, I would rather sit inside and play minecraft, compared to my other option, which is, taking the
#       dogs for a walk.
#   1. When the mood is just right on windy days, I would rather sit inside and play mine.
#   2. When the mood is just right on windy days, I would rather sit inside and play Minecraft compared to my other option, which is taking the dogs for a walk.
#   3. When the mood is just right, on windy days, I would rather sit inside and play Minecraft compared to my other option, which is taking the dogs for a walk.
#   4. When the mood is just right on windy days, I would rather sit inside and play my
#   5. When the mood is just right on windy days, I would rather sit inside and play my...

# Conclusions
# this was a smashing success! The recognizer still struggled a bit holding out for the end of the "when the mood is just right..." sentence, but
# to be fair, this sentence is a long, runon sentence that would rarely be heard in practice. And still, the "base.en" model, which was evalualting the
# sentence at near real time, 

# Follow-up Question:
# Could I ditch reliance on the `sr.Recognizer` speech timing altogether by streaming audio to a local speech recognition model?
# Answer: Yes, but open AI's whisper model does not offer streaming support. Other offline open source models exist though, like Vosk
#   Mozilla DeepSpeech, Kaldi, wav2letter++, and NVIDIA NeMo

# Could I reliably stitch two or more "listening sessions" to generate accurate transcriptions?


In [62]:
# Attempt to stitch two audios and play back

r = sr.Recognizer()

with sr.Microphone() as source:
    print("Say something...")
    audio1 = r.listen(source)
    print("keep reading...")
    audio2 = r.listen(source)
raw_audio_combined = audio1.get_raw_data() + audio2.get_raw_data()
audio_combined = sr.AudioData(raw_audio_combined, audio1.sample_rate, audio1.sample_width)

p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(audio.sample_width),
                channels=1,
                rate=audio.sample_rate,
                output=True)

# Play the audio data
stream.write(raw_audio_combined)

# Clean up
stream.stop_stream()
stream.close()
p.terminate()

# Speech Recognition
print(f"1: {r.recognize_whisper(audio1, model="base.en")}")
print(f"2: {r.recognize_whisper(audio2, model="base.en")}")
print(f"combined: {r.recognize_whisper(audio_combined, model="base.en")}")
print(f"combined without transforming to raw: {r.recognize_whisper(audio1 + audio2, model="base.en")}")


Say something...
keep reading...
1: 
2:  Test, tests, test day one.
combined:  Test, tests, test day one.


TypeError: unsupported operand type(s) for +: 'AudioData' and 'AudioData'

In [None]:
# Answer: Yes! It does require transforming the audio data instances into their raw data forms before concatenating though, but this seems like
# a small cost for such large potential.

# Considering this & the fact that the `listen` method takes a callback that will trigger when voice audio is detected, the new speech recognition
# workflow can be completely revamped. The goal is to more accurately cut speech recognition when the prompt is completely spoken.

# Currently, the input window is consistently being shut too soon, which makes this software practically useless. It is definitely a safe bet to err
# on the side of keeping the input window open for longer than ideal, but should still be mindful that too far in this direction can stall 
# productive, higher frequency conversations.

# Also a nice-to-have might be: After a certain length of spoken prompt, say 200 words, the agent should ask for permission whether or not to invoke
# with this prompt. Maybe the agent will say "I heard up to '...some last words of the prompt', is that right?" And the response to this should be handled
# by a "dumb" model, maybe a small, local model or a cheap online model like deepseek. Importantly, this model will have access to edit the working prompt.

# Another nice-to-have would be diarization. And the model would take a first-come-first-serve policy.