### Week 7

This notebook guides you through building an offline voice assistant pipeline that:
1. Records audio from your microphone
2. Transcribes speech to text using Whisper
3. Processes the text using Ollama
4. Generates an aduio response using Kokoro

First, we will process audio using the sounddevice library.

Run the following in your terminal: ` pip install sounddevice numpy scipy `

In [2]:
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
def record_until_silence(threshold=0.01, silence_duration=1.0, sample_rate=44100):
    print("Recording... Speak into the microphone.")
    recording = []
    silence_counter = 0
    
    def callback(indata, frames, time, status):
        nonlocal silence_counter
        volume_norm = np.linalg.norm(indata) / np.sqrt(frames)
        if volume_norm < threshold:
            silence_counter += frames
        else:
            silence_counter = 0
        recording.append(indata.copy())

    with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate):
        while silence_counter < silence_duration * sample_rate:
            sd.sleep(100)

    audio_data = np.concatenate(recording)

    # Save file as "output.wav"
    write("output.wav", sample_rate, audio_data)
    print("Recording saved as 'output.wav'")

Next, we will install Whisper (https://github.com/openai/whisper/tree/main). This requires three steps:
1. Install PyTorch using the command generated here: https://pytorch.org/
2. In your terminal, run: ` pip install -U openai-whisper ` 
3. Install ffmpeg: (requires Chocolatey or Homebrew depending of OS)
- Windows: ` choco install ffmpeg `
- Mac: ` brew install ffmpeg `

Once complete, visit the Whisper github for an initial implementation.

In [None]:
import whisper
def speech_to_text(path): # path to audio (.mp3 or .wav)
    model = whisper.load_model("./tiny.en.pt")
    options = whisper.DecodingOptions(fp16=False)
    result = model.transcribe(path, **options.__dict__)
    # result = model.transcribe(path)
    return result["text"]

We can now pass our input through an LLM. Revisit Week 4 to implement an LLM response using Ollama.

In [None]:
import ollama
from ollama import Client
def generate_text(input):
    client = Client()
    response = client.create(
        model="jarvis:latest",
        from_="llama3.2:latest",
        system="Your name is Jarvis. You are a virutal assistant.",
        stream=False,
    )

    print(response.status)

    res = ollama.chat(
        model = "jarvis:latest",
        messages = [
            {"role": "user", "content": input}
        ]
    )

    return res["message"]["content"]

Finally, implement our TTS using Kokoro (https://huggingface.co/hexgrad/Kokoro-82M):

In [None]:
# pip install -q "kokoro>=0.9.2" soundfile
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
def text_to_speech(input):
    pipeline = KPipeline(lang_code='a', model="kokoro_small")
    text = input
    generator = pipeline(text, voice='af_heart')
    for i, (gs, ps, audio) in enumerate(generator):
        # current_tokens = len(ps.split())  # Phonemes correlate to token usage
        # print(f"Iteration {i}: {current_tokens} tokens")
        print(i, gs, ps)
        display(Audio(data=audio, rate=24000, autoplay=i==0))
        sf.write(f'{i}.wav', audio, 24000)

        print("Playing recorded audio...")
        data, fs = sf.read(f'{i}.wav', dtype='float32')  # Load audio data
        sd.play(data, fs)  # Play the audio file
        # sd.wait()  # Wait until playback finishes
        break

All the componenets of the pipeline are now complete! Try putting everything together below.

In [None]:
# TODO: put everything together
record_until_silence()
input = speech_to_text("./output.wav")
output = generate_text(input)
text_to_speech(output)