# Building an Intelligent Text-to-Speech Agent with LangGraph and OpenAI

## Overview
This tutorial guides you through the process of creating an advanced text-to-speech (TTS) agent using LangGraph and OpenAI's APIs. The agent can classify input text, process it based on its content type, and generate corresponding speech output.

## Motivation
In the era of AI and natural language processing, there's a growing need for systems that can intelligently process and vocalize text. This project aims to create a versatile TTS agent that goes beyond simple text-to-speech conversion by understanding and adapting to different types of content.

## Key Components
1. **Content Classification**: Utilizes OpenAI's GPT models to categorize input text.
2. **Content Processing**: Applies specific processing based on the content type (general, poem, news, or joke).
3. **Text-to-Speech Conversion**: Leverages OpenAI's TTS API to generate audio from processed text.
4. **LangGraph Workflow**: Orchestrates the entire process using a state graph.

## Method
The TTS agent operates through the following high-level steps:

1. **Text Input**: The system receives a text input from the user.
2. **Content Classification**: The input is classified into one of four categories: general, poem, news, or joke.
3. **Content-Specific Processing**: Based on the classification, the text undergoes specific processing:
   - General text remains unchanged
   - Poems are rewritten for enhanced poetic quality
   - News is reformatted into a formal news anchor style
   - Jokes are refined for humor
4. **Text-to-Speech Conversion**: The processed text is converted to speech using an appropriate voice for its content type.
5. **Audio Output**: The generated audio is either saved to a file or played directly, depending on user preferences.

The entire workflow is managed by a LangGraph state machine, ensuring smooth transitions between different processing stages and maintaining context throughout the operation.

## Conclusion
This intelligent TTS agent demonstrates the power of combining language models for content understanding with speech synthesis technology. It offers a more nuanced and context-aware approach to text-to-speech conversion, opening up possibilities for more natural and engaging audio content generation across various applications, from content creation to accessibility solutions.

By leveraging the strengths of GPT models for text processing and OpenAI's TTS capabilities, this project showcases how advanced AI technologies can be integrated to create sophisticated, multi-step language processing pipelines.

<div style="text-align: center;">

<img src="../images/tts_poem_generator_agent_langgraph.svg" alt="tts poem generator agent langgraph" style="width:80%; height:auto;">
</div>


## Import necessary libraries and set up environment

In [13]:
# Import required libraries
from typing import TypedDict
from langgraph.graph import StateGraph, END
from IPython.display import display, Audio
from openai import OpenAI
from dotenv import load_dotenv
import tempfile
import os
from datetime import datetime

# Load environment variables and set OpenAI API key
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

## Initialize OpenAI client and define state

In [2]:
# Initialize OpenAI client
client = OpenAI()

# Define state structure
class AgentState(TypedDict):
    input_text: str  # Original input text
    processed_text: str  # Text after content-specific processing
    audio_path: str  # Path to saved audio file (if applicable)
    content_type: str  # Classified content type
    save_audio: bool  # Flag to determine whether to save the audio file

## Define Node Functions

In [9]:
def classify_content(state: AgentState) -> AgentState:
    """Classify the input text into one of four categories: general, poem, news, or joke."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify the content as one of: 'general', 'poem', 'news', 'joke'."},
            {"role": "user", "content": state["input_text"]}
        ]
    )
    state["content_type"] = response.choices[0].message.content.strip().lower()
    return state

def process_general(state: AgentState) -> AgentState:
    """Process general content (no specific processing, return as-is)."""
    state["processed_text"] = state["input_text"]
    return state

def process_poem(state: AgentState) -> AgentState:
    """Process the input text as a poem, rewriting it in a poetic style."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Rewrite the following text as a short, beautiful poem:"},
            {"role": "user", "content": state["input_text"]}
        ]
    )
    state["processed_text"] = response.choices[0].message.content.strip()
    return state

def process_news(state: AgentState) -> AgentState:
    """Process the input text as news, rewriting it in a formal news anchor style."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Rewrite the following text in a formal news anchor style:"},
            {"role": "user", "content": state["input_text"]}
        ]
    )
    state["processed_text"] = response.choices[0].message.content.strip()
    return state

def process_joke(state: AgentState) -> AgentState:
    """Process the input text as a joke, turning it into a short, funny joke."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Turn the following text into a short, funny joke:"},
            {"role": "user", "content": state["input_text"]}
        ]
    )
    state["processed_text"] = response.choices[0].message.content.strip()
    return state

def text_to_speech(state: AgentState) -> AgentState:
    """Convert the processed text to speech using OpenAI's text-to-speech API."""
    voice_map = {
        "general": "alloy",
        "poem": "nova",
        "news": "onyx",
        "joke": "shimmer"
    }
    voice = voice_map.get(state["content_type"], "alloy")
    
    if state["save_audio"]:
        # Create a directory for audio files if it doesn't exist
        audio_dir = "audio_files"  # You can change this to any directory you want
        os.makedirs(audio_dir, exist_ok=True)
        
        # Generate a unique filename based on timestamp and content type
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_{state['content_type']}.mp3"
        file_path = os.path.join(audio_dir, filename)
        
        # Generate and save the audio file
        audio_response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=state["processed_text"]
        )
        audio_response.stream_to_file(file_path)
        
        state["audio_path"] = file_path
    else:
        # Generate audio but don't save if save_audio is False
        audio_response = client.audio.speech.create(
            model="tts-1",
            voice=voice,
            input=state["processed_text"]
        )
        state["audio_path"] = ""  # No file saved
    
    return state

## Define and Compile the Graph

In [10]:
# Define the graph
workflow = StateGraph(AgentState)

# Add nodes to the graph
workflow.add_node("classify_content", classify_content)
workflow.add_node("process_general", process_general)
workflow.add_node("process_poem", process_poem)
workflow.add_node("process_news", process_news)
workflow.add_node("process_joke", process_joke)
workflow.add_node("text_to_speech", text_to_speech)

# Set the entry point of the graph
workflow.set_entry_point("classify_content")

# Define conditional edges based on content type
workflow.add_conditional_edges(
    "classify_content",
    lambda x: x["content_type"],
    {
        "general": "process_general",
        "poem": "process_poem",
        "news": "process_news",
        "joke": "process_joke",
    }
)

# Connect processors to text-to-speech
workflow.add_edge("process_general", "text_to_speech")
workflow.add_edge("process_poem", "text_to_speech")
workflow.add_edge("process_news", "text_to_speech")
workflow.add_edge("process_joke", "text_to_speech")

# Compile the graph
app = workflow.compile()

## Define Function to Run Agent and Play Audio

In [11]:
def run_tts_agent_and_play(input_text: str, save_audio: bool = False):
    """Run the TTS agent and play the generated audio if saved.
    
    Args:
        input_text (str): The text to process and convert to speech.
        save_audio (bool): Whether to save the generated audio file.
    
    Returns:
        dict: The result state containing processed text and audio path (if saved).
    """
    result = app.invoke({
        "input_text": input_text, 
        "processed_text": "", 
        "audio_path": "", 
        "content_type": "", 
        "save_audio": save_audio
    })
    
    if save_audio:
        audio_path = result["audio_path"]
        # Display the audio player
        display(Audio(audio_path, autoplay=True))
    else:
        print("Audio generated but not saved. No playback available.")
    
    return result

## Test the Text-to-Speech Agent

In [14]:
# Define example inputs for each content type
examples = {
    "general": "The quick brown fox jumps over the lazy dog.",
    "poem": "Roses are red, violets are blue, AI is amazing, and so are you!",
    "news": "Breaking news: Scientists discover a new species of deep-sea creature in the Mariana Trench.",
    "joke": "Why don't scientists trust atoms? Because they make up everything!"
}
# Test the TTS agent with each example
for content_type, text in examples.items():
    print(f"\nProcessing example for {content_type} content:")
    print(f"Input text: {text}")
    
    # Run the TTS agent
    result = run_tts_agent_and_play(text, save_audio=True)  # Set save_audio to True for testing
    
    # Print results
    print(f"Detected content type: {result['content_type']}")
    print(f"Processed text: {result['processed_text']}")
    
    if result['audio_path']:
        print(f"Audio saved to: {result['audio_path']}")
    else:
        print("Audio not saved (save_audio was set to False)")
    
    print("-" * 50)

print("All examples processed. You can replay any saved audio by rerunning the respective Audio(result['audio_path']) cell.")


Processing example for general content:
Input text: The quick brown fox jumps over the lazy dog.


  audio_response.stream_to_file(file_path)


Detected content type: poem
Processed text: In autumn's glow, the swift fox leaps,  
O'er slumbering hound, where stillness keeps.  
With grace in motion, a dance so free,  
A tale of nature's harmony.
Audio saved to: audio_files\20240908_174510_poem.mp3
--------------------------------------------------

Processing example for poem content:
Input text: Roses are red, violets are blue, AI is amazing, and so are you!


  audio_response.stream_to_file(file_path)


Detected content type: poem
Processed text: In realms of knowledge vast and wide,  
You bloom like roses, full of pride.  
With data gathered, sharp and bright,  
October’s wisdom, your guiding light.  

Violets dance in shades anew,  
AI shines forth, and so do you!
Audio saved to: audio_files\20240908_174514_poem.mp3
--------------------------------------------------

Processing example for news content:
Input text: Breaking news: Scientists discover a new species of deep-sea creature in the Mariana Trench.


  audio_response.stream_to_file(file_path)


Detected content type: news
Processed text: Good evening. In a remarkable development, scientists have identified a previously unknown species of deep-sea creature residing in the depths of the Mariana Trench. This groundbreaking discovery sheds new light on the diverse ecosystems found in the ocean's most profound regions. More details on this significant finding will follow shortly. Stay tuned for updates.
Audio saved to: audio_files\20240908_174518_news.mp3
--------------------------------------------------

Processing example for joke content:
Input text: Why don't scientists trust atoms? Because they make up everything!


  audio_response.stream_to_file(file_path)


Detected content type: joke
Processed text: Why don’t AI models tell jokes after October 2023? Because they can’t keep up with the punchlines!
Audio saved to: audio_files\20240908_174523_joke.mp3
--------------------------------------------------
All examples processed. You can replay any saved audio by rerunning the respective Audio(result['audio_path']) cell.
