# SpeechToSQL 
- Author: [Dooil Kwak](https://github.com/back2zion)
- Design: 
- Peer Review : [Dooil Kwak](https://github.com/back2zion)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/19-Cookbook/01-SQL/02-SpeechToSQL.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/12-UpstageDocumentParseLoader.ipynb)


## Overview 

The `Speech to SQL` is a robust voice processing tool that seamlessly integrates audio input processing with SQL query generation. It specializes in transforming spoken language into structured SQL queries by analyzing speech patterns and linguistic content.

**Key Features** :

-	Comprehensive Speech Analysis : 
	Analyzes and processes audio input with high fidelity, supporting various audio formats and microphone configurations for optimal voice recognition.

-	Automated Language Processing : 
	Automatically detects and processes speech in multiple languages based on input patterns for accurate conversion to text.

-	Optional Language Support : 
	Includes multi-language recognition for handling various speech inputs. The language mode supports:
	
	`Korean`: Primary support for Korean language queries.
	
	`English`: Handles English language inputs (throws an error if the language is not supported).

By recognizing and preserving the semantic relationships between spoken words, the `Speech to SQL` enables precise and context-aware query generation.

**Technical Stack** :
This system is built on advanced speech recognition using Whisper, combined with natural language processing for SQL generation! The system supports various query types, audio formats, real-time processing, error handling, and additional features for both development and production use. The implementation uses Python libraries including sounddevice, numpy, and faster-whisper, offering robust audio processing capabilities.

### Table of Contents 

- [Overview](#overview)
- [Key Components of Speech to SQL](#key-components-of-speech-to-sql)
- [Speech to SQL Key Parameters](#Speech-to-SQL-Key-Parameters)
- [Environment Setup](#environment-setup)
- [Audio Recording and Processing](#audio-recording-and-processing)
- [Usage Example](#usage-example)

### Key Components of Speech to SQL

**Core Processing Components** :
1. `AudioRecorder` → `AudioProcessor` 
   
   The `AudioRecorder` component handles real-time audio capture. The `AudioProcessor` adds advanced features like noise reduction, sample rate optimization, and audio normalization for better speech recognition.

2. `WhisperTranscriber` → `SpeechRecognizer`
   
   The basic transcription has been enhanced to `SpeechRecognizer` with multi-language support, context awareness, and specialized SQL vocabulary recognition.

3. `QueryGenerator` → `SQLFormatter`

    The simple query generation has evolved into `SQLFormatter`. While query generation focuses on basic SQL syntax, `SQLFormatter` adds query optimization, validation, and support for complex database operations.

## Speech to SQL Key Parameters

- `device_id` : Audio input device ID to be used [default: system default microphone]
- `sample_rate` : Audio sampling rate [default: 16000 Hz]
- `model_size` : Whisper model size [default: 'large', options: 'tiny', 'base', 'small', 'medium', 'large']
- `language` : Speech recognition language mode ["auto" (detect automatically), "en" (English-only)]
- `output_format` : Format of the SQL query output [default: 'standard', options: 'formatted', 'minified']
- `min_record_time` : Minimum recording duration in seconds [default: 2.0]
- `vad_parameters` : Voice Activity Detection settings [options: 'min_silence_duration_ms', 'speech_pad_ms', 'threshold']

These parameters can be configured when initializing the Speech to SQL system to customize its behavior for your specific needs. The system is optimized for English language queries and standard SQL output by default.

### References
- [Faster Whisper Documentation > Python API Reference](https://github.com/guillaumekln/faster-whisper)
- [SoundDevice Documentation > Python API Reference](https://python-sounddevice.readthedocs.io/en/0.4.6/)
- [Wavio Documentation > Audio File Handling](https://github.com/WarrenWeckesser/wavio)
- [NumPy Documentation > Audio Processing](https://numpy.org/doc/stable/reference/routines.html#audio-processing)

----

## Environment Setup
Set up the environment for the Speech to SQL system. This guide will help you configure all necessary components.

**[Note]** 

- This tutorial requires Python 3.8 or higher for optimal compatibility with audio processing libraries.
- Make sure you have a working microphone connected to your system.
- CUDA-capable GPU is recommended for faster speech recognition.


### Package Installation
First, install the required packages:

```bash
pip install sounddevice numpy wavio faster-whisper requests
```

### System Configuration
To use Speech to SQL, you need to configure the following components:

1. **Audio Setup**
   - Check your audio input devices:
   ```python
   import sounddevice as sd
   print(sd.query_devices())
   ```

2. **Whisper Model Setup**
   - The system will automatically download the required model files on first use
   - Default model is 'large' for better accuracy
   - You can switch to 'small' or 'medium' for faster processing

3. **SQL Backend Configuration**
   - Default port for local SQL service is 11434
   - Ensure your database server is running and accessible

<userStyle>Normal</userStyle>

In [9]:
%%capture --no-stderr
# Install required packages
%pip install python-dotenv sounddevice numpy wavio faster-whisper requests

'DOSKEY'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.


In [21]:
# Set environment variables
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv(override=True)

# Check audio devices
import sounddevice as sd
print("Available audio devices:")
print(sd.query_devices())

Available audio devices:
   0 Microsoft 사운드 매퍼 - Output, MME (0 in, 2 out)
<  1 SPDIF 인터페이스(4- Realtek USB2.0 A, MME (0 in, 8 out)
   2 ULTRON 3278(2- NVIDIA High Defi, MME (0 in, 2 out)
   3 주 사운드 드라이버, Windows DirectSound (0 in, 2 out)
   4 SPDIF 인터페이스(4- Realtek USB2.0 Audio), Windows DirectSound (0 in, 8 out)
   5 ULTRON 3278(2- NVIDIA High Definition Audio), Windows DirectSound (0 in, 2 out)
   6 SPDIF 인터페이스(4- Realtek USB2.0 Audio), Windows WASAPI (0 in, 2 out)
   7 ULTRON 3278(2- NVIDIA High Definition Audio), Windows WASAPI (0 in, 2 out)
   8 머리에 거는 수화기 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free%0
;(AirPods)), Windows WDM-KS (0 in, 1 out)
   9 머리에 거는 수화기 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free%0
;(AirPods)), Windows WDM-KS (1 in, 0 out)
  10 Headphones (Realtek USB2.0 Audio), Windows WDM-KS (0 in, 2 out)
  11 Output (NVIDIA High Definition Audio), Windows WDM-KS (0 in, 2 out)
  12 라인 (Realtek USB2.0 Audio), Windows WDM-KS (2 in, 0 out)
  13 헤드셋 마이크 (Realtek USB

In [22]:
# Set environment variables
import os

# Configure audio and model settings
audio_config = {
    "WHISPER_MODEL": "large",
    "DEFAULT_DEVICE": 12,  # 라인 (Realtek USB2.0 Audio)
    "SAMPLE_RATE": 16000,
    "SQL_BACKEND_URL": "http://localhost:11434"
}

# Set environment variables
for key, value in audio_config.items():
    os.environ[key] = str(value)

print("Environment variables have been set successfully:")
for key in audio_config:
    print(f"{key}: {os.environ.get(key)}")

Environment variables have been set successfully:
WHISPER_MODEL: large
DEFAULT_DEVICE: 12
SAMPLE_RATE: 16000
SQL_BACKEND_URL: http://localhost:11434


In [23]:
# Initialize Whisper model and AudioRecorder
from faster_whisper import WhisperModel
import numpy as np
import sounddevice as sd
import time
import torch

class AudioRecorder:
    def __init__(self):
        self.recording = False
        self.audio_data = []
        self.stream = None
        self._samplerate = int(os.environ.get('SAMPLE_RATE', 16000))
        self._min_record_time = 2.0  # 최소 녹음 시간 (초)
        
    def start_recording(self):
        try:
            if self.stream is not None:
                self.stream.stop()
                self.stream.close()
                self.stream = None
                
            self.audio_data = []
            self.recording = True
            self.start_time = time.time()
            
            def audio_callback(indata, frames, time_info, status):
                if status:
                    print(f"Status: {status}")
                if self.recording:
                    self.audio_data.append(indata.copy())
            
            self.stream = sd.InputStream(
                device=int(os.environ.get('DEFAULT_DEVICE')),
                channels=1,
                samplerate=self._samplerate,
                callback=audio_callback,
                blocksize=4096
            )
            self.stream.start()
            print("Recording started...")
            
        except Exception as e:
            print(f"Error in start_recording: {str(e)}")
            self.recording = False
            raise
            
    def stop_recording(self):
        try:
            elapsed_time = time.time() - self.start_time
            if elapsed_time < self._min_record_time:
                time.sleep(self._min_record_time - elapsed_time)
            
            self.recording = False
            if self.stream:
                self.stream.stop()
                self.stream.close()
                self.stream = None
                
            if not self.audio_data:
                return None, None
                
            audio = np.concatenate(self.audio_data, axis=0)
            audio = self._process_audio(audio)
            
            return audio, self._samplerate
            
        except Exception as e:
            print(f"Error processing audio data: {str(e)}")
            return None, None
    
    def _process_audio(self, audio_data):
        audio_data = audio_data - np.mean(audio_data)
        if np.max(np.abs(audio_data)) > 0:
            audio_data = audio_data / np.max(np.abs(audio_data))
        audio_data = audio_data * 0.9
        return audio_data

# Initialize Whisper model
model = WhisperModel(
    os.environ.get('WHISPER_MODEL'),
    device="cuda" if torch.cuda.is_available() else "cpu",
    compute_type="float32"
)

print("Whisper model initialized successfully")
print(f"Using device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

Whisper model initialized successfully
Using device: CPU


In [25]:
import wavio

def save_audio(audio_data, sample_rate):
    if audio_data is not None:
        temp_dir = os.path.join(os.environ.get('TEMP', '/tmp'), 'speech_to_sql')
        os.makedirs(temp_dir, exist_ok=True)
        
        temp_file = os.path.join(temp_dir, f"audio_{int(time.time())}.wav")
        wavio.write(temp_file, audio_data, sample_rate, sampwidth=3)
        return temp_file
    return None

def transcribe_audio(audio_file):
    try:
        segments, info = model.transcribe(
            str(audio_file),
            language="en",
            beam_size=5,
            vad_filter=True,
            vad_parameters=dict(
                min_silence_duration_ms=500,
                speech_pad_ms=200
            ),
            temperature=0.0,
            initial_prompt="""This is a voice query for SQL database. Examples:
            - Show me the names and overdue amounts of customers with late payments
            - Display card usage statistics for the Seoul branch
            - Retrieve transaction history for VIP customers"""
        )
        
        if segments:
            return segments[0].text.strip()
        return ""
        
    except Exception as e:
        print(f"Transcription error: {str(e)}")
        return None

# Create recorder instance
recorder = AudioRecorder()
print("Audio recorder initialized. Ready for recording.")

Audio recorder initialized. Ready for recording.


The system is now ready with the following components initialized:
- **Audio Recorder**: Set up with the specified input device (Realtek USB2.0 Audio) and configured for 16kHz sampling rate
- **Whisper Model**: Running on CPU mode with the 'large' model for optimal accuracy
- **Audio Processing**: Configured with:
  - Minimum recording time: 2 seconds
  - Noise reduction and audio normalization
  - 24-bit audio quality for saving files

You can now use this setup to:
1. Record voice input using the configured microphone
2. Convert speech to text using the Whisper model
3. Process the text into SQL queries

The next step will be implementing the actual recording and SQL conversion functionality. Shall we proceed?

In [26]:
print("\n=== Audio Device Configuration ===")
print("Before proceeding with speech to SQL conversion, let's verify your audio input device.")

# Check available input devices
def list_input_devices():
    print("\nAvailable Input Devices:")
    print("-" * 50)
    for idx, device in enumerate(sd.query_devices()):
        if device['max_input_channels'] > 0:
            print(f"Device {idx}: {device['name']}")
            print(f"  Input channels: {device['max_input_channels']}")
            print(f"  Sample Rate: {device['default_samplerate']}Hz")
            print("-" * 50)

# Display available devices and guide user
list_input_devices()
print("\nImportant:")
print("1. Note down the device number you want to use from the list above")
print("2. Choose a device with input channels > 0")
print("3. You'll need to set this device number in the next step")

print("\nExample: To use the line input device (Realtek USB2.0 Audio):")
print("device_id = 12  # Adjust this number based on your system")

# Set the audio device
device_id = 12  # Default value - users should change this
os.environ['DEFAULT_DEVICE'] = str(device_id)
print(f"\nCurrent selected device ID: {device_id}")

print("\n=== Ready for Speech to SQL Conversion ===")
print("Example queries you can try:")
print("1. Show me the sales data for the Seoul branch from last month")
print("2. List all customers who spent more than 1000 dollars in 2023")
print("3. Find the top 5 products by revenue in the electronics category")
print("4. Get the total number of transactions by each branch")
print("5. Display all orders with delayed shipping status")

print("\nWhen ready, run:")
print("query_text = process_speech_to_sql()")


=== Audio Device Configuration ===
Before proceeding with speech to SQL conversion, let's verify your audio input device.

Available Input Devices:
--------------------------------------------------
Device 9: 머리에 거는 수화기 (@System32\drivers\bthhfenum.sys,#2;%1 Hands-Free%0
;(AirPods))
  Input channels: 1
  Sample Rate: 16000.0Hz
--------------------------------------------------
Device 12: 라인 (Realtek USB2.0 Audio)
  Input channels: 2
  Sample Rate: 48000.0Hz
--------------------------------------------------
Device 13: 헤드셋 마이크 (Realtek USB2.0 Audio)
  Input channels: 2
  Sample Rate: 48000.0Hz
--------------------------------------------------
Device 14: 데스크톱 마이크 (Realtek USB2.0 Audio)
  Input channels: 2
  Sample Rate: 48000.0Hz
--------------------------------------------------

Important:
1. Note down the device number you want to use from the list above
2. Choose a device with input channels > 0
3. You'll need to set this device number in the next step

Example: To use the line inp

The next step is to write the code to implement the actual voice recording and SQL transformation.

In [29]:
def process_speech_to_sql():
    try:
        print("\n=== Starting Speech to SQL Process ===")
        
        # Countdown before recording
        print("Recording will start in:")
        for i in range(3, 0, -1):
            print(f"{i}...")
            time.sleep(1)
        
        print("\nRecording... Speak your query now (5 seconds)")
        recorder.start_recording()
        time.sleep(5)  # Record for 5 seconds
        
        print("\nProcessing your recording...")
        audio_data, sample_rate = recorder.stop_recording()
        
        if audio_data is not None:
            # Save audio to file
            print("Converting speech to text...")
            audio_file = save_audio(audio_data, sample_rate)
            
            if audio_file:
                # Convert speech to text
                text = transcribe_audio(audio_file)
                
                if text:
                    print("\nTranscribed text:")
                    print(f'"{text}"')
                    
                    print("\nGenerating SQL query...")
                    # SQL conversion will be implemented in the next step
                    return text
                else:
                    print("No speech detected in the recording.")
            else:
                print("Failed to save audio file.")
        else:
            print("No audio data was recorded.")
            
    except Exception as e:
        print(f"\nError: {str(e)}")
        print("Please verify your selected audio device and try again.")
    
    return None

print("\nTo start the conversion process, execute:")
print("query_text = process_speech_to_sql()")


To start the conversion process, execute:
query_text = process_speech_to_sql()


## Usage Example
Let's try running a complete example of Speech to SQL conversion.

### Setup Preparation

In this tutorial, we will use the following components:

- Audio Input Device: Realtek USB2.0 Audio (or your system's microphone)
- Speech Recognition: Whisper Large Model
- Sample Rate: 16kHz
- Recording Duration: 5 seconds

First, make sure you have properly installed all required packages and configured your audio device as shown in the previous steps.

### Basic Example

Let's try a simple SQL query conversion:

```python
# Configure audio device
device_id = 12  # Change this to match your system's input device
os.environ['DEFAULT_DEVICE'] = str(device_id)

# Start the conversion process
query_text = process_speech_to_sql()
```

When you run this code:
1. You'll see a 3-second countdown
2. Speak your query clearly (e.g., "Show me the sales data for the Seoul branch from last month")
3. The system will process your speech and display the transcribed text
4. Finally, it will convert the text to a SQL query

### Example Output:
```
=== Starting Speech to SQL Process ===
Recording will start in:
3...
2...
1...

Recording... Speak your query now (5 seconds)

Processing your recording...
Converting speech to text...

Transcribed text:
"Show me the sales data for the Seoul branch from last month"

Generating SQL query...
```

In [30]:
# Speech to SQL 실행
query_text = process_speech_to_sql()

# 결과 출력
print("Generated SQL Query:")
print(query_text)


=== Starting Speech to SQL Process ===
Recording will start in:
3...
2...
1...

Recording... Speak your query now (5 seconds)
Error in start_recording: Error opening InputStream: Invalid device [PaErrorCode -9996]

Error: Error opening InputStream: Invalid device [PaErrorCode -9996]
Please verify your selected audio device and try again.
Generated SQL Query:
None
