# Audio Recording, Diarization, and Transcription System

This project provides functionality to record audio, perform speaker diarization, and transcribe the speech content from recorded audio using libraries such as **PyAudio**, **Wave**, **Vosk**, **Whisper**, and **Pydub**. Below is detailed documentation on how this system works, the libraries it uses, and the models required.

---

## **Features**
1. **Audio Recording**  
   Records live audio input from the microphone for a specified duration.
   
2. **Speaker Diarization**  
   Identifies and segments speech from different speakers in the recorded audio.

3. **Speech Transcription**  
   Converts the segmented audio from speaker diarization into text.
   
---

## **Libraries Used**

### **1. PyAudio**
   - **Purpose**: Handles live audio recording from the microphone.
   - **Installation**:  
     ```bash
     pip install pyaudio
     ```
   - **Documentation**: [PyAudio Documentation](https://people.csail.mit.edu/hubert/pyaudio/)

### **2. Wave**
   - **Purpose**: Reads and writes WAV audio files.
   - **Installation**: Built into Python; no separate installation is required.
   - **Documentation**: [Wave Documentation](https://docs.python.org/3/library/wave.html)

### **3. Pydub**
   - **Purpose**: Manages audio processing tasks such as audio segmentation.
   - **Installation**:  
     ```bash
     pip install pydub
     ```

   - **Documentation**: [Pydub Documentation](https://github.com/jiaaro/pydub)

### **4. Vosk**
   - **Purpose**: Provides speaker diarization using a pre-trained Vosk model.
   - **Installation**:  
     ```bash
     pip install vosk
     ```
   - **Documentation**: [Vosk API Documentation](https://alphacephei.com/vosk/)

### **5. Whisper**
   - **Purpose**: Performs automatic speech recognition (ASR) for transcription.
   - **Installation**:  
     ```bash
     pip install openai-whisper
     ```
   - **Dependencies**: Requires PyTorch. Install PyTorch as per your system configuration:  
     ```bash
     pip install torch torchvision torchaudio
     ```
   - **Documentation**: [Whisper GitHub Repository](https://github.com/openai/whisper)

---

## **Models Used**

### **1. Whisper ASR Model**
   - **Source**: OpenAI's Whisper
   - **Purpose**: Converts audio to text with high accuracy.
   - **Model Variant Used**: `base`
   - **Documentation**: [Whisper Models](https://github.com/openai/whisper#available-models)

   ```python
   model = whisper.load_model("base")
   ```

### **2. Vosk Speaker Diarization Model**
   - **Source**: Vosk
   - **Purpose**: Identifies and segments audio by speakers.
   - **Model Used**: `vosk-model-small-en-us-0.15`
   - **Download**: [Vosk Models](https://alphacephei.com/vosk/models)
   - **Model Path Example**: `"vosk-model-small-en-us-0.15"`

---

## **Code Walkthrough**

### **1. Recording Audio**
The `record_audio` function captures audio from the system microphone and saves it as a WAV file.

#### **Parameters**
- `output_filename` (str): Path to save the recorded audio file.
- `duration` (int): Duration of the recording in seconds.

#### **Key Functions**
- `p.open`: Opens the audio stream for recording.
- `stream.read`: Reads audio chunks and saves them to a list.
- `wave.open`: Saves the recorded audio in WAV format.

#### Example Call
```python
audio_filename = "live_audio.wav"
record_audio(audio_filename, duration=10)
```

---

### **2. Diarizing Audio**
The `diarize_audio` function segments audio into speaker-specific parts using Vosk.

#### **Parameters**
- `audio_file_path` (str): Path to the audio file to process.
- `model_path` (str): Path to the Vosk model directory.

#### **Key Functions**
- `KaldiRecognizer`: Processes the WAV file and identifies text and speaker segments.
- Outputs a list of tuples in the format `(start_time, end_time, speaker)`.

---

### **3. Transcribing Audio**
The `transcribe_audio` function uses the Whisper model to transcribe audio into text.

#### **Parameters**
- `segment_path` (str): Path to the audio segment file for transcription.

#### **Key Functions**
- `model.transcribe`: Converts the audio segment into text.

---

### **4. Processing Audio**
The `process_audio` function combines diarization, transcription, and playback.

#### **Steps**
1. Call `diarize_audio` to get speaker segments.
2. Extract individual audio segments using Pydub.
3. Transcribe each segment using Whisper.
4. Print the results grouped by speaker.

---

## **Example Workflow**

```python
audio_filename = "live_audio.wav"
record_audio(audio_filename, duration=50)  
process_audio(audio_filename, model_path="vosk-model-small-en-us-0.15") 
```

---

## **Outputs**
1. **Segmented Audio Playback**  
   Plays each audio segment corresponding to a specific speaker.
   
2. **Transcription**  
   Prints the transcription of audio for each speaker. Example:
   ```
   Person1: Hello, how are you?
   Person2: I'm good, thank you.
   ```


In [20]:
import pyaudio
import wave
import os
import json
from pydub import AudioSegment
from pydub.playback import play
from vosk import Model, KaldiRecognizer
from pyannote.audio import Pipeline
import whisper


whisper_model = whisper.load_model("base")
pyannote_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your_huggingface_token"
)

def record_audio(output_filename, duration=10):
    """
    Records audio from the microphone and saves it as a .wav file.
    """
    chunk = 1024
    sample_format = pyaudio.paInt16
    channels = 1
    fs = 16000

    p = pyaudio.PyAudio()
    print("Recording...")

    stream = p.open(format=sample_format,
                    channels=channels,
                    rate=fs,
                    frames_per_buffer=chunk,
                    input=True)

    frames = []
    for _ in range(0, int(fs / chunk * duration)):
        data = stream.read(chunk)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    p.terminate()

    print("Recording finished")

    wf = wave.open(output_filename, 'wb')
    wf.setnchannels(channels)
    wf.setsampwidth(p.get_sample_size(sample_format))
    wf.setframerate(fs)
    wf.writeframes(b''.join(frames))
    wf.close()

def diarize_with_vosk(audio_file_path, model_path="path_to_your_vosk_model"):
    """
    Diarizes audio using the Vosk model.
    """
    if not os.path.exists(model_path):
        raise ValueError("Model path does not exist. Please provide a valid path to the Vosk model.")

    model = Model(model_path)
    wf = wave.open(audio_file_path, "rb")

    if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getframerate() != 16000:
        raise ValueError("Audio file must be WAV format mono PCM with 16kHz sample rate.")

    recognizer = KaldiRecognizer(model, wf.getframerate())
    recognizer.SetWords(True)

    segments = []
    start_time = 0.0

    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break

        if recognizer.AcceptWaveform(data):
            result = json.loads(recognizer.Result())
            if "text" in result and result["text"]:
                end_time = wf.tell() / wf.getframerate()
                segments.append((start_time, end_time, "Speaker_1"))
                start_time = end_time

    wf.close()
    return segments


def diarize_with_pyannote(audio_file_path):
    """
    Diarizes audio using Pyannote.
    """
    diarization = pyannote_pipeline(audio_file_path)
    return diarization

def transcribe_audio(segment_path):
    """
    Transcribes audio using Whisper.
    """
    result = whisper_model.transcribe(segment_path)
    return result["text"]

def process_audio_with_segments(audio_file_path, segments, speaker_mapping=None):
    """
    Processes audio segments for playback and transcription.
    """
    audio = AudioSegment.from_file(audio_file_path)
    dialogue = ""

    for i, (start_time, end_time, speaker) in enumerate(segments):
        start_ms = int(start_time * 1000)
        end_ms = int(end_time * 1000)
        segment_audio = audio[start_ms:end_ms]
        segment_path = f"temp_segment_{speaker}_{i}.wav"
        segment_audio.export(segment_path, format="wav")

        print(f"Playing segment for {speaker} from {start_time:.2f}s to {end_time:.2f}s...")
        play(segment_audio)

        transcribed_text = transcribe_audio(segment_path)
        speaker_name = speaker_mapping.get(speaker, speaker) if speaker_mapping else speaker
        dialogue += f"{speaker_name}: {transcribed_text.strip()}  \n"

        os.remove(segment_path)

    dialogue += "\n"
    return dialogue


if __name__ == "__main__":
    audio_filename = "live_audio2.wav"
    record_audio(audio_filename, duration=50)

    use_vosk = False

    if use_vosk:
        vosk_model_path = "vosk-model-small-en-us-0.15"
        segments = diarize_with_vosk(audio_filename, vosk_model_path)
        speaker_map = None  
    else:
        diarization = diarize_with_pyannote(audio_filename)
        segments = [(turn.start, turn.end, speaker) for turn, _, speaker in diarization.itertracks(yield_label=True)]
        speaker_map = {"SPEAKER_00": "Person1", "SPEAKER_01": "Person2"}

    dialogue = process_audio_with_segments(audio_filename, segments, speaker_map)
    print(dialogue)


Recording...
Recording finished


  std = sequences.std(dim=-1, correction=1)


Playing segment for SPEAKER_01 from 0.03s to 0.52s...
Playing segment for SPEAKER_01 from 1.04s to 3.56s...
Playing segment for SPEAKER_00 from 3.86s to 6.83s...
Playing segment for SPEAKER_01 from 6.83s to 11.46s...
Playing segment for SPEAKER_00 from 11.74s to 17.11s...
Playing segment for SPEAKER_01 from 17.61s to 22.58s...
Playing segment for SPEAKER_00 from 22.93s to 27.66s...
Playing segment for SPEAKER_01 from 27.94s to 32.48s...
Playing segment for SPEAKER_00 from 32.90s to 36.11s...
Person2:   
Person2: Hey, have you started preparing for the NLP final?  
Person1: Not yet, I have been caught up with the project. What about you?  
Person2: Same here, the group project is taking all my time and I haven't even started doing the theory.  
Person1: Exactly, the project is related to NLP but the final is more about concepts like embedding and Transformers.  
Person2: Yeah, and I barely remember anything about the attention mechanism. I think I need to revise the basics.  
Person1: M

## Documentation: Loading and Testing a Fine-Tuned T5 Model

Here we load our fine-tuned T5 model and its tokenizer. The `AutoTokenizer` and `AutoModelForSeq2SeqLM` classes are utilized to load the model and tokenizer from our specified local directory. The model is moved to the GPU if available, and the conversation converted from audio to text is passed to it for summarization. The input prompt includes the conversation text along with a directive to summarize it. The model's output is compared to a human-generated baseline summary to evaluate its performance.

Key components include creating tokenized inputs for the model, generating the summary with a specified token limit, and decoding the output into human-readable text. The script outputs the input prompt, a human baseline summary, and the model-generated summary for comparison. This setup is ideal for testing fine-tuned T5 models in tasks like dialogue summarization or similar sequence-to-sequence applications.

In [21]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_save_path = "./fine_tunede7_t5_model"

tokenizer_loaded = AutoTokenizer.from_pretrained(model_save_path)
model_loaded = AutoModelForSeq2SeqLM.from_pretrained(model_save_path)

print("Model and tokenizer loaded successfully!")


prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

device = "cuda" if torch.cuda.is_available() else "cpu"
model_loaded = model_loaded.to(device)  

inputs = tokenizer_loaded(prompt, return_tensors="pt").to(device)

output = tokenizer_loaded.decode(
    model_loaded.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)


dash_line = '-' * 100
print(dash_line)
print(f"INPUT PROMPT:\n{prompt}")
print(dash_line)
print(f"MODEL GENERATION - FINE-TUNED:\n{output}")


Model and tokenizer loaded successfully!
----------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Person2:   
Person2: Hey, have you started preparing for the NLP final?  
Person1: Not yet, I have been caught up with the project. What about you?  
Person2: Same here, the group project is taking all my time and I haven't even started doing the theory.  
Person1: Exactly, the project is related to NLP but the final is more about concepts like embedding and Transformers.  
Person2: Yeah, and I barely remember anything about the attention mechanism. I think I need to revise the basics.  
Person1: Me too. How about we split the topics? You focus on embedding and I'll go over transformers.  
Person2: That sounds good. Let's also look at the passapres to see what topics usually come up.  
Person1: Good idea, we can meet tomorrow to go over everything. We've got this.  



Summary:

--------------