# Assignment: **Design an End-to-End AI Voice Assistance Pipeline 🎙️🤖**


## **Objective:**
* Design a pipeline that takes a voice query command, converts it into text, uses a Large Language Model (LLM) to generate a response, and then converts the output text back into speech.
* The system should have low latency, Voice Activity Detection (VAD), restrict the output to 2 sentences, and allow for tunable parameters such as pitch, male/female voice,
and speed.


* In this assignment, we’ll create a seamless pipeline that transforms voice commands into text, generates an AI response using a Large Language Model (LLM), and then converts that response back into speech.
* Our system will incorporate the following components:
1. Voice Input Processing (Transcription) 🎤
2. Language Model Response (LLM) 📝🧠
3. Text-to-Speech Conversion (TTS) 🗣️🔊


## 1. Voice Input Processing (Transcription) 🎤🔍:
* Utilize an Automatic Speech Recognition (ASR) model (such as Whisper) to transcribe voice queries into text.
* Handle audio preprocessing, resampling, and stereo-to-mono conversion.

In [None]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-b8fi8pf2
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-b8fi8pf2
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-n

In [None]:
!pip install webrtcvad

Collecting webrtcvad
  Downloading webrtcvad-2.0.10.tar.gz (66 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/66.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.2/66.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: webrtcvad
  Building wheel for webrtcvad (setup.py) ... [?25l[?25hdone
  Created wheel for webrtcvad: filename=webrtcvad-2.0.10-cp310-cp310-linux_x86_64.whl size=73459 sha256=11d12cb072471734c6059ed0dcc66c2ef1442256c9de3310a9a32923992371be
  Stored in directory: /root/.cache/pip/wheels/2a/2b/84/ac7bacfe8c68a87c1ee3dd3c66818a54c71599abf308e8eb35
Successfully built webrtcvad
Installing collected packages: webrtcvad
Successfully installed webrtcvad-2.0.10


In [None]:
import whisper
import torchaudio
import numpy as np
import tempfile
import os

# Load the Whisper model
model = whisper.load_model("base.en")

# Define the audio processing settings
audio_path = "/content/lofi_chase_on_the_floor(256k).mp3"  # Replace with your audio file path
sampling_rate = 16000  # Target sampling rate

# Load and resample the audio
waveform, original_sr = torchaudio.load(audio_path)
waveform = torchaudio.transforms.Resample(orig_freq=original_sr, new_freq=sampling_rate)(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0).unsqueeze(0)

# Save the processed audio to a temporary file
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
    temp_path = temp_file.name
    torchaudio.save(temp_path, waveform, sampling_rate)

# Transcribe the audio using Whisper
result = model.transcribe(temp_path, language="en")

# Remove the temporary file
os.remove(temp_path)

# Output the transcribed text
print("\nTranscription:")
for line in result['text'].split('\n'):
  print(line)


Transcription:
 turnout I'm no snooze And everybody knows I get off the train And the ladies will choose the truth I'm like inception I play with you when you're so sweet I'm no snooze I'm playing no games But don't, don't, don't, don't, don't They confuse no Cause you will lose, yeah Man, now pump, pump, pump, pump, pump And bump it up and back it up like a tonk a truck If you go hard you gotta get on the floor If you're a party, we can step on the floor If you're an animal, then tear up the floor Break the sweat on the floor, yeah, we work on the floor Don't stop, keep it falling, put your drinks up If you're body up and drop it on the floor Let the rhythm change the world on the floor You know we're running to night on the floor You know we're running to night on the floor Of course you're no rocker And men and children feel so Straight to a lady of pain Just a happy day Day is the night away If your life can stay on on the floor Day is the night away Grab somebody drink the money 

## 2. Language Model Response (LLM) 📝🧠:
* Employ a pre-trained LLM (e.g., GPT-2) to generate a contextually relevant response based on the transcribed text.
* Limit the response to a concise and informative output (e.g., two sentences).

In [None]:
!pip install torch



In [11]:
from transformers import AutoModelForCausalLM, GPT2TokenizerFast
import torch

# Load pre-trained LLM model
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a tokenizer and set pad_token to eos_token
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Input text from Step 1 (replace with actual transcription)
input_text = result['text']  # Ensure `result['text']` contains valid text

# Generate LLM response
input_ids = tokenizer.encode(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024)
attention_mask = input_ids.ne(tokenizer.pad_token_id).long()  # Create attention mask based on padding


with torch.no_grad():
    # Generate the response with a controlled length
    output = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=50, num_return_sequences=1)
    response = tokenizer.decode(output[0], skip_special_tokens=True)

print("LLM Response:", response)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LLM Response:  turnout I'm no snooze And everybody knows I get off the train And the ladies will choose the truth I'm like inception I play with you when you're so sweet I'm no snooze I'm playing no games But don't, don't, don't, don't, don't They confuse no Cause you will lose, yeah Man, now pump, pump, pump, pump, pump And bump it up and back it up like a tonk a truck If you go hard you gotta get on the floor If you're a party, we can step on the floor If you're an animal, then tear up the floor Break the sweat on the floor, yeah, we work on the floor Don't stop, keep it falling, put your drinks up If you're body up and drop it on the floor Let the rhythm change the world on the floor You know we're running to night on the floor You know we're running to night on the floor Of course you're no rocker And men and children feel so Straight to a lady of pain Just a happy day Day is the night away If your life can stay on on the floor Day is the night away Grab somebody drink the money Se

## 3. Text-to-Speech Conversion (TTS) 🗣️🔊:
* Convert the LLM-generated response back into speech.
* Use a TTS system (e.g., edge-tts) to create an audio output.

In [13]:
!pip install edge-tts

Collecting edge-tts
  Downloading edge_tts-6.1.12-py3-none-any.whl.metadata (4.0 kB)
Downloading edge_tts-6.1.12-py3-none-any.whl (29 kB)
Installing collected packages: edge-tts
Successfully installed edge-tts-6.1.12


In [16]:
import os
import asyncio
import edge_tts
# Use edge-tts to convert LLM response to speech
async def text_to_speech(text: str, output_file: str):
    communicate = edge_tts.Communicate(text, voice='en-US-GuyNeural')
    await communicate.save(output_file)

# Save the LLM response as speech in a .wav file
output_file = "output.wav"
# Use await instead of asyncio.run() inside an existing event loop
await text_to_speech(response, output_file)

print(f"Speech saved as {output_file}")

Speech saved as output.wav


## 4. Additional Considerations:
* Implement Voice Activity Detection (VAD) to identify when the user is speaking.
* Allow for tunable parameters, such as pitch, male/female voice, and speed.
Ensure low latency throughout the pipeline.

### 4.1. Voice Activity Detection (VAD) Implementation 🔍🎙️
* Voice Activity Detection (VAD) is crucial for identifying when the user starts and stops speaking, ensuring that the pipeline processes only relevant audio input. You can use the webrtcvad library to implement VAD.

In [34]:
import webrtcvad
import torchaudio
import numpy as np
import tempfile
import os

def detect_voice_activity(waveform, sample_rate, frame_duration=30):
    vad = webrtcvad.Vad(1)  # Mode 1 for less aggressive VAD
    samples_per_frame = int(sample_rate * frame_duration / 1000)
    num_frames = waveform.size(1) // samples_per_frame

    voice_activity = False
    for i in range(num_frames):
        start = i * samples_per_frame
        end = start + samples_per_frame
        frame = waveform[0, start:end].numpy()

        # Ensure the frame is in 16-bit PCM format
        frame = (frame * 32767).astype(np.int16)
        if vad.is_speech(frame.tobytes(), sample_rate):
            voice_activity = True
            break

    return voice_activity


# Process and check audio file
waveform, original_sr = torchaudio.load(audio_path)
waveform = torchaudio.transforms.Resample(orig_freq=original_sr, new_freq=sampling_rate)(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0).unsqueeze(0)

# Check for voice activity
if detect_voice_activity(waveform, sampling_rate):
    print("Voice activity detected. Proceeding with transcription...")
else:
    print("No voice activity detected.")


Voice activity detected. Proceeding with transcription...


## 4.2. Generate and Restrict LLM Response

In [49]:
import re

def truncate_response(text, max_sentences=2):
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return ' '.join(sentences[:max_sentences])

# Generate LLM response
input_ids = tokenizer.encode(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024)
attention_mask = input_ids.ne(tokenizer.pad_token_id).long()

with torch.no_grad():
    output = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=50, num_return_sequences=1)
    response = tokenizer.decode(output[0], skip_special_tokens=True)

# Truncate response to a maximum of 2 sentences
truncated_response = truncate_response(response)
print("LLM Response (Truncated):", truncated_response)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LLM Response (Truncated): The Last I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I'm close, I've been close, lost Troops, I found one Rolls-p Your welcome, and love them to the fizzer Stray to the world in your van This is the half-way dada Day is the light away If your life can stay on the floor Day is the light away Grab somebody drink the