# Voice Agents

For this project we will be creating a voice agent "the sandwich way", meaning that we implement a pipeline like:
```
TTS -> AGENT -> STT
```
Another common implementation is to use a whole voice agent alltogether, but we lose control over the different parts and usually we cannot use the most recent llms.

On the other hand, "the sandwich" leaves us with more liability and is less production ready. Needs more work. 

> **NOTE:** Also, one thing that's not always said when talking about this sandwich architcture is that there is actually an additional step: VAD (Voice Activity Detection) which detects when the user is speaking and when he/she is done. We can skip this if we record by pressing a button, but again, less natural-conversation-like feeling.

Reference: https://docs.langchain.com/oss/python/langchain/voice-agent#overview

You can find the full working implementation here: [projects/voice agents](../projects/voice_agents/README.md). We will go over the highlights of the project in this notebook.

## Speech to Text (STT)

We can implement STT in different ways, but mainly we either:
- stream a transcription model
- input an audio file into a transcription model (no streaming)

In [STT](../projects/voice_agents/src/STT/README.md) you can find 3 examples of STT: two of these use [Deepgram](https://developers.deepgram.com/home), one uses Whisper from OpenAI. 

Using Deepgram we can directly stream and reduce latency: Whisper needs files to transcribe, so must first save to files and then transcribe -> high latency (but also, hig accuracy). 

Key differences: 

- Deepgram `Nova3`: streaming transcription model, efficient
- Deepgram `Flux`: AI transcription model, automatically detects pauses (integrated VAD): perfect for voice agents
- OpenAI's `Whisper`: high accuracy, does not stream -> high latency

For the actual implementation we use the flux model.

## Text To Speech 

For the text to speech part there exist some lightweight models that you can run locally on your machine (you do not even need a GPU, they can run on CPU: mind tho that running on CPU they will be slower and you'll lose that feeling of natural, 0 latency models. But.. they are free.)

So, we use [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M): check [the tts readme](../projects/voice_agents/src/TTS/README.md) for how to download it from Hugging Face. We run it on cpu with onnx optimization, but you can modify the code to run it on gpu if you have one.

## Graph

### Constructing Tools from Api's

We implement some arxiv api [tools](../projects/voice_agents/src/graph/tools.py) by using some helpers functions that leverage the arxiv api, defined in [`arxiv_helpers`](../projects/voice_agents/src/graph/arxiv_helpers/arxiv_functions). 

This is a nice example of how to construct tools from an api: you'll notice that we do all the work inside the helper functions, but the process is shadowed from the agent: the actual tools are something like

```python
@tool
def download_pdf(paper_id: Annotated[str, "The ID of the paper to download from the arXiv"]):
    """
    Download the PDF of a paper from arXiv, given its ID.
    """
    print(f"Attempting to download paper with id: {paper_id}...")

    return download_arxiv_pdf(paper_id)
```

but `download_arxiv_pdf()` is defined as:

```python
def download_arxiv_pdf(paper_id: str, save_dir: str = "./downloads"):
    """
    Downloads the PDF of an arXiv paper given its ID.
    
    Args:
        paper_id (str): The arXiv ID (e.g., "2103.00020").
        save_dir (str): The directory to save the PDF in. Defaults to "./downloads".
    
    Returns:
        str: The file path of the downloaded PDF.
    """
    # Ensure directory exists
    os.makedirs(save_dir, exist_ok=True)
    
    client = arxiv.Client()
    
    # We must "search" by ID to get the paper object
    search = arxiv.Search(id_list=[paper_id])
    
    try:
        paper = next(client.results(search))
        
        # Create a safe filename using the ID and a sanitized title
        # e.g., "2103.00020_Attention_Is_All_You_Need.pdf"
        safe_title = "".join(c for c in paper.title if c.isalnum() or c in (' ', '_', '-')).rstrip()
        safe_title = safe_title.replace(" ", "_")
        filename = f"{paper_id}_{safe_title}.pdf"
        
        # Download
        path = paper.download_pdf(dirpath=save_dir, filename=filename)
        return f"Successfully downloaded file to: {path}"
        
    except StopIteration:
        return f"Error: Paper with ID {paper_id} not found."
    except Exception as e:
        return f"Error downloading paper: {str(e)}"
```

It doesn't make sense for the agent to know all the work that goes on for downloading a pdf. It would just fill its context and let it be proner to errors, given the higher complexity. **Always keep tools as simple as possible**. 

### Details

- You'll notice that we use Human In the Loop throught the human in the loop middleware on `download_pdf`: we want to be sure before downloading. 

- The function that handles streams is [`stream_graph_task`](../projects/voice_agents/src/graph/executor.py): you'll notice that it also handles interrupts. we need a function like this because then we pass it inside the flux model's stream. 

- The nice cli look is made through the use of the `rich` library.