Thanks for the job. this is the first pure speech modal conversation I have read. but there is some question in my mind.
currenty speech conversational synthesize jobs are bridged with text, so that it can use NLP- LLM as the brain. many security and controlibility are done in the LLM model.
but in the prue audio speech-to-speech, there is no such convenient intermedia, the 8Hz speech codes are unreadable and are not semantic centered. so how to control the response content of the hertz model?