<a href="https://colab.research.google.com/github/Troyanovsky/Building-with-GenAI/blob/main/tutorial_voice_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build with GenAI: Turn Rambling into Coherent Writing

- Completely local (not using OpenAI's API). You can run the code on your own computer and keep everything private. Or you can use Google Colab's free T4 GPU (just hit Runtime - Run All)
- You can adapt the code easily to perform other tasks like journaling, brainstorming, etc.

This is the accompanying code for this tutorial: https://medium.com/design-bootcamp/build-with-genai-turn-rambling-into-writing-with-whisper-and-local-llm-394e8dd5b83f

This is part of the "Build with GenAI" series. Other tutorial projects can be found at: https://github.com/Troyanovsky/Building-with-GenAI/tree/main

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
!pip install --upgrade pip
!pip install --upgrade transformers datasets[audio] accelerate
!pip install https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu122/llama_cpp_python-0.2.90-cp310-cp310-linux_x86_64.whl
!pip install gradio
!apt-get -y install -qq aria2
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/Hermes-3-Llama-3.1-8B.Q5_K_M.gguf?download=true -d /content/gguf_models/ -o Hermes-3-Llama-3.1-8B.Q5_K_M.gguf

Collecting pip
  Downloading pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.2
Collecting transformers
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
Collecting accelerate
  Downloading accelerate-1.0.1-py3-none-any.whl.metadata (19 kB)
Collecting datasets[audio]
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Col

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from llama_cpp import Llama
import gradio as gr

# Global variables
whisper_pipe = None
llm = None

In [None]:
# Whisper V3 Turbo Setup
def load_whisper():
    global whisper_pipe
    if whisper_pipe is None:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

        model_id = "openai/whisper-large-v3-turbo"

        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
        )
        model.to(device)

        processor = AutoProcessor.from_pretrained(model_id)

        whisper_pipe = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            torch_dtype=torch_dtype,
            device=device,
            return_timestamps=True,
            chunk_length_s=30,
            stride_length_s=5
        )

def unload_whisper():
    global whisper_pipe
    if whisper_pipe is not None:
        del whisper_pipe
        whisper_pipe = None
        torch.cuda.empty_cache()

def transcribe_audio(audio):
    load_whisper()
    result = whisper_pipe(audio)
    unload_whisper()
    full_transcription = " ".join([chunk['text'] for chunk in result['chunks']])
    return full_transcription

In [None]:
# Llama Setup
def load_llama():
    global llm
    if llm is None:
        llm = Llama(
            model_path="/content/gguf_models/Hermes-3-Llama-3.1-8B.Q5_K_M.gguf",
            n_ctx=8192,
            n_gpu_layers=-1
        )

def unload_llama():
    global llm
    if llm is not None:
        del llm
        llm = None
        torch.cuda.empty_cache()

def process_text(text):
    load_llama()

    output = llm.create_chat_completion(
        messages = [
            {"role": "system", "content": "You are an assistant who turns user's words into coherent writing in prose in markdown format."},
            {
                "role": "user",
                "content": f"Please turn the following text into coherent writing in prose in markdown format. Keep the original meaning and do not add your own ideas. Text: ```{text}``` Reply just the markdown content."
            }
        ]
    )

    print(output)

    improved_text = output['choices'][0]['message']['content'].strip()
    unload_llama()
    return improved_text

In [None]:
def voice_to_writing(audio):
    transcription = transcribe_audio(audio)

    improved_text = process_text(transcription)
    return transcription, improved_text

In [None]:
# Gradio UI
iface = gr.Interface(
    fn=voice_to_writing,
    inputs=gr.Audio(type="filepath", label="Audio Input"),
    outputs=[
        gr.Textbox(label="Transcription"),
        gr.Textbox(label="Your Writing")
    ],
    title="Rambling to Writing App",
    description="Convert rambling to coherent writing."
)

iface.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://326c23d999caa485d3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.71M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
llama_model_loader: loaded meta data with 27 key-value pairs and 292 tensors from /content/gguf_models/Hermes-3-Llama-3.1-8B.Q5_K_M.gguf (version GGUF V3 (latest))
llama_mo

{'id': 'chatcmpl-c328ca46-0b5e-4520-8774-83a2abba134c', 'object': 'chat.completion', 'created': 1729590069, 'model': '/content/gguf_models/Hermes-3-Llama-3.1-8B.Q5_K_M.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "It appears that IGN may be manipulating user votes in their Game of the Year competition for Black Myth Wukong. The game's win rate dropped from 90% to 74% in just one hour, raising concerns about potential tampering.\n\nA few weeks ago, IGN introduced a system to determine the Game of the Year, where games face off in duels. In theory, the game that wins the most duels should be crowned Game of the Year. However, it has been observed that the system continues to show a game even when it loses multiple duels.\n\nPeople have noticed suspicious behavior in the Black Myth Wukong competition. The game's win rate suddenly collapsed, and it gained 44,000 more votes in 60 minutes while other games' duels counts remained stable. Some users even recorded