This project is divided into three parts:
1. The speech-to-text to get the user's input
2. The LLM to generate a response
3. The text-to-voice to output the response

# 1.Speech recognition

In this part, we'll need two functions, one which converts an audio file to text andd one to get voice from the mic.

In [2]:
import speech_recognition as sr

r = sr.Recognizer()

def recognize_speech_from_mic(recognizer):
    with sr.Microphone() as source:
        print("Listening...")
        r.pause_threshold = 2
        audio_text = r.listen(source)
    print("Recognizing...")
        
    try:
        return r.recognize_google(audio_text)
    except:
        print("Could not recognize ...")
        return recognize_speech_from_mic(recognizer)

# print(recognize_speech_from_mic(r))

# 2. LLM

In this part, we'll need a function to keep a conversation with an llm.

In [3]:
from groq import Groq
import os
import json
from dotenv import load_dotenv

load_dotenv()

def groq_chatbot_conversation(new_message, model="llama-3.1-8b-instant", api_key=None, history_file="conversation_history.txt"):
    """
    A function to interact with a Groq-based chatbot that maintains conversation context using a text file.

    Parameters:
    - new_message (str): The user's latest input message.
    - model (str): The model to use (default: "llama-3.1-8b-instant").
    - api_key (str): Groq API key (default: None, will use GROQ_API_KEY from environment variables if not provided).
    - history_file (str): The file to store conversation history (default: "conversation_history.txt").

    Returns:
    - str: The chatbot's response.
    """
    api_key = api_key or os.getenv("GROQ_API_KEY")
    if not api_key:
        raise ValueError("API key is required. Set it via the 'api_key' parameter or the 'GROQ_API_KEY' environment variable.")
    
    # Load conversation history
    if os.path.exists(history_file):
        with open(history_file, "r", encoding="utf-8") as file:
            messages = json.load(file)
    else:
        messages = []
    
    # Append new user message
    messages.append({"role": "user", "content": new_message})
    
    client = Groq()
    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=1,
        max_completion_tokens=1024,
        top_p=1,
        stream=True,
        stop=None,
    )
    
    response_text = ""
    for chunk in completion:
        response_text += chunk.choices[0].delta.content or ""
    
    # Append assistant response
    messages.append({"role": "assistant", "content": response_text})
    
    # Save updated conversation history
    with open(history_file, "w", encoding="utf-8") as file:
        json.dump(messages, file, indent=4)
    
    return response_text

# Example usage:
# print(groq_chatbot_conversation("It's me!"))


In [17]:
print(groq_chatbot_conversation("the song name is correct but you kinda wiffed on the lyrics ngl"))

I was close, but I clearly misremembered the lyrics. Thank you for correcting me and being kind about it. "Hello" is a classic Adele song, but I guess I need to brush up on my lyrics. If you'd like to share the correct lyrics, I'd be happy to learn from my mistake!


# 3. Text-to-Speech

In this part we'll need a function that converts text to speech

In [13]:
import asyncio
import edge_tts
import io
import nest_asyncio
from pydub import AudioSegment
from pydub.playback import play

# Allow asyncio to work in Jupyter Notebook
nest_asyncio.apply()

async def speak(text: str, voice="en-US-AriaNeural", rate="+100%"):
    """Convert text to speech and play it directly in Jupyter Notebook."""
    tts = edge_tts.Communicate(text, voice,rate =rate)
    stream = io.BytesIO()

    async for chunk in tts.stream():
        if chunk["type"] == "audio":
            stream.write(chunk["data"])

    stream.seek(0)
    audio = AudioSegment.from_file(stream, format="mp3")
    play(audio)

# Function to run async code in Jupyter
async def tts_play(text, voice="en-US-AriaNeural", rate="+100%"):
    await speak(text, voice, rate)

# Example usage
# await tts_play("""It was a dark and stormy night. All of a sudden, a voice came out of the darkness and said, "Hello! I'm here to help you with your query. How can I assist you today?""")


# 4. Main

This part will use the functions from the other parts to create a loop containing all the voice bot.

In [14]:
while True:
    user_text = recognize_speech_from_mic(r)
    print("user: ", user_text)
    response = groq_chatbot_conversation(user_text, model="llama-3.1-8b-instant", history_file="First_try.txt")
    print("bot: ", response)
    await tts_play(response, voice="en-US-AnaNeural", rate="+50%")

Listening...
Recognizing...
user:  well it's more like another name that's going to take as input my voice and then three kids and the response to it through the other room and then get through a text to speech module to answer me
bot:  So it sounds like you're trying to create a voice-controlled system, where you speak to an assistant, and it responds back to you through a text-to-speech module.

You're using a framework that can process your voice input, understand your requests, and then send responses back to you through a text-to-speech module. This is often referred to as a voice assistant or a voice interaction system.

In this case, the hyperparameter tuning you're doing is probably related to the speech recognition module, which needs to accurately transcribe your voice input into text. Tweaking the speaking rates, pitch, and cadence might be related to the text-to-speech module, which needs to convert the text response into human-like speech that sounds natural and clear.

Is

OSError: [Errno -9988] Stream closed