How a Voice AI Assistant Works in Real-time?

Before we write the code, let‚Äôs understand what we are building. An AI voice assistant is essentially a loop of three distinct biological functions replicated by code:

1.The Ears (Speech-to-Text): We capture audio vibrations and translate them into text.

2. The Brain (LLM Inference): We send that text to a Large Language Model (Ollama/Llama 3) to generate a smart response.

3. The Mouth (Text-to-Speech): We convert the AI‚Äôs text response back into audio so we can hear it.

1Ô∏è‚É£ IMPORT SECTION

In [3]:
# ===== IMPORTS =====

import speech_recognition as sr
import pyttsx3
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

2Ô∏è‚É£ EARS (Speech to Text)

In [4]:
# ===== EARS (Listening System) =====

recognizer = sr.Recognizer()

def listen():
    with sr.Microphone() as source:
        print("üé§ Listening...")
        recognizer.adjust_for_ambient_noise(source, duration=0.5)
        audio = recognizer.listen(source)

        try:
            text = recognizer.recognize_google(audio)
            print("üó£ You said:", text)
            return text
        except:
            print("‚ö†Ô∏è Could not understand.")
            return None

3Ô∏è‚É£ BRAIN (Small Local LLM)

In [5]:
# ===== BRAIN (AI Thinking) =====

model_name = "google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to("cpu")

def think(user_input):
    prompt = f"Answer clearly and concisely:\n{user_input}"

    inputs = tokenizer(prompt, return_tensors="pt")

    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



4Ô∏è‚É£ MOUTH (Text to Speech)

In [27]:
# ===== MOUTH (Windows Safe Version) =====

from gtts import gTTS
import os
import time

def speak(text):
    print("ü§ñ AI:", text)
    
    filename = "response.mp3"
    
    # Convert text to speech
    tts = gTTS(text=text, lang="en")
    tts.save(filename)
    
    # Play using Windows default player
    os.system(f"start {filename}")
    
    # Optional: wait and cleanup
    time.sleep(5)
    if os.path.exists(filename):
        os.remove(filename)

5Ô∏è‚É£ MAIN (Control Center)

In [28]:
# ===== MAIN LOOP =====

print("ü§ñ Voice AI Assistant Started")
print("Say 'stop' to exit.\n")

while True:
    user_input = listen()

    if user_input is None:
        continue

    if "stop" in user_input.lower():
        speak("Goodbye Ravi. Have a great day!")
        break

    response = think(user_input)
    speak(response)

ü§ñ Voice AI Assistant Started
Say 'stop' to exit.

üé§ Listening...
üó£ You said: hello hello hello hello hello hello hello hello hello hello hello
ü§ñ AI: hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello
üé§ Listening...
‚ö†Ô∏è Could not understand.
üé§ Listening...
‚ö†Ô∏è Could not understand.
üé§ Listening...
üó£ You said: stop stop
ü§ñ AI: Goodbye Ravi. Have a great day!
