# WhiTTsper the Lora
(Whisper + TTS + Alpaca-Lora) - *I'm not good at naming things*

*@ImPavloh*

**Remember to switch the runtime environment to GPU before running the code.**

**Installing Libraries and Dependencies**

*Estimated installation time: around 2 minutes.*

In [None]:
!pip install -q git+https://github.com/zphang/transformers@c3dc391
!pip install -q git+https://github.com/huggingface/peft.git
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q datasets loralib sentencepiece
!pip install -q bitsandbytes
!pip install -q langdetect
!pip install -q gradio
!pip install -q torch
!pip install -q gtts
import torch
import whisper
import tempfile
import gradio as gr
from gtts import gTTS
from peft import PeftModel
from langdetect import detect
from transformers import LLaMATokenizer, LLaMAForCausalLM, GenerationConfig

**Load and configure the Alpaca-lora (LLaMA) model with 7B parameters.**

*Estimated execution time: around 3 minutes.*

In [None]:
tokenizer = LLaMATokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LLaMAForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")

Define functions that enable the construction of a task-based question and answer system.

In [None]:
def gen_instruction(instruction, input=None):
    input_section = f"### Input:\n{input}\n" if input else ""
    return f"""Here is an instruction that describes a task. Write a response that adequately completes the request.

### Instruction:
{instruction}

{input_section}### Response:"""

generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    num_beams=4,
    num_return_sequences=1,
    max_length=512,
)

def evaluate(instruction, input=None, temperature=0.1):
    prompt = gen_instruction(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(model.device)
    generation_output = modelo.generate(
        input_ids=input_ids,
        generation_config=GenerationConfig(
            temperature=temperature,
            top_p=0.75,
            num_beams=4,
            num_return_sequences=1,
            max_length=512,
        ),
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256
    )
    sequence = generation_output.sequences[0]
    output = tokenizer.decode(sequence)
    response_text = output.split("### Response:")[-1].strip()
    return response_text

# **Gradio Interface**

When you click on the recording button, the microphone will start recording. The user also has the option to select the size of the Whisper model.

The larger the Whisper model, the better it will be, but it will take longer.

1.   **Tiny** 39M parameters
2.   **Base** 74M parameters **[recommended]**
3.   **Small** 244M parameters
4.   **Medium** 269M parameters
5.   **Large** 1550M parameters


Once the recording is submitted, the audio will be automatically transcribed with Whisper and LLaMa will respond with an audio. The message will appear in the conversation history along with the rest of the messages.

In [None]:
# WIP

def stt(tmp_filename, model_size, reset_conversation, temperature):
    if reset_conversation:
        gr.State.conversation_history = []
        return "", None, None

    if not tmp_filename or not model_size:
        return "Please record an audio, select a Whisper model or attach an audio file.", None, None

    if not hasattr(gr.State, "conversation_history"):
        gr.State.conversation_history = []

    conversation_history = gr.State.conversation_history

    try:
        whisper_model = whisper.load_model(model_size)
        result = modelo_whisper.transcribe(tmp_filename)

        text_input = result['text']
        conversation_history.append("User: " + text_input)
        conversation_input = "\n".join(conversation_history)
        
        response_text = evaluate(conversation_input, temperature=temperature)
        
        conversation_history.append("AI: " + response_text.strip())

        formatted_conversation_history = "\n".join(conversation_history)

        language = detect(response_text)
        tts = gTTS(response_text, lang=language)
        _, response_audio_path = tempfile.mkstemp(suffix='.mp3')
        tts.save(response_audio_path)

        return formatted_conversation_history, response_audio_path, response_text.strip()

    except (ValueError, Exception) as e:
        error_message = str(e)
        return f"Error: {error_message}", None, None

inputs = [
    gr.Audio(source='microphone', type='filepath', label="Talk here 🗣️"),
    gr.Dropdown(choices=['tiny', 'base', 'small', 'medium', 'large'], value='base', label="Model size 📦"),
    gr.Checkbox(label="Restart conversation", value=False),
    gr.Slider(min_value=0.1, max_value=1.0, step=0.1, value=0.1, label="Generation temperature 🌡️"),
]

outputs = [
    gr.Textbox(label="Conversation history 📄"),
    gr.Audio(type='filepath', label="Response 🔊"),
]

gr.Interface(
    fn=stt,
    inputs=inputs,
    outputs=outputs,
    allow_flagging='never',
    title="🗣️ WhiTTsper the Lora 🦜",
    description="This demo uses speech recognition and speech synthesis technologies, OpenAI's Whisper and Google Text-to-Speech respectively, to interact with the Alpaca-LoRA AI ~ LLaMa 7B model.",
    css="footer {visibility: hidden}"
    ).launch()