<a href="https://colab.research.google.com/github/HariniMaruthasalam/AICTE_IBM_Internship_Project-DA/blob/main/Ichigo_Llama3_1_v0_4_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A Sample Inference Code for 🍓 Ichigo v0.4: Local real-time voice AI (Formerly llama3-s).**
<div class="align-center">
  <img src="https://github.com/janhq/ichigo/blob/main/images/ichigov0.2.jpeg?raw=true" width="400"></a>
  <p><small>Image source: <a href="https://www.amazon.co.uk/When-Llama-Learns-Listen-Feelings/dp/1839237988">"When Llama Learns to Listen"</a></small></p>
</div>


---


## Join Us

🍓 Ichigo is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.


## Install Dependencies

In [None]:
%%shell
pip install -q openai-whisper==20231117 IPython matplotlib vector_quantize_pytorch webdataset
pip install -q whisperspeech
pip install -q -U transformers bitsandbytes

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/798.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m716.8/798.6 kB[0m [31m21.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.6/46.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.8/74.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.0/103.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.4 M



In [None]:
import torch
import torchaudio
from whisperspeech.vq_stoks import RQBottleneckTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from huggingface_hub import hf_hub_download
import os

## Download a sound requesting our model to code a random python script

In [None]:
%%shell
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp' -O codeapythonscript.wav
wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1IShlXCiNrY0QBs7TeKxOH2zoh3IzXRrF' -O writeastory.wav

--2024-11-19 04:26:37--  https://docs.google.com/uc?export=download&id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp
Resolving docs.google.com (docs.google.com)... 74.125.197.138, 74.125.197.102, 74.125.197.101, ...
Connecting to docs.google.com (docs.google.com)|74.125.197.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp&export=download [following]
--2024-11-19 04:26:37--  https://drive.usercontent.google.com/download?id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.199.132, 2607:f8b0:400e:c00::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.199.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60972 (60K) [audio/wav]
Saving to: ‘codeapythonscript.wav’


2024-11-19 04:26:40 (98.7 MB/s) - ‘codeapythonscript.wav’ saved [60972/60972]





## First, we need to convert the audio file to sound tokens

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-v3-7lang-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-v3-7lang-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-v3-7lang-fixed.model"
    ).to(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
    vq_model.ensure_whisper(device)

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


whisper-vq-stoks-v3-7lang-fixed.model:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  spec = torch.load(local_filename)


## Then, we can inference the model the same as any other LLM.

In [None]:
def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=512, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/Ichigo-llama3.1-s-instruct-v0.4"
pipe = setup_pipeline(llm_path, use_8bit=True)

tokenizer_config.json:   0%|          | 0.00/149k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Unused kwargs: ['bnb_8bit_compute_dtype', 'bnb_8bit_use_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

### Code generation

In [None]:
# Usage
sound_tokens = audio_to_sound_tokens("codeapythonscript.wav")

messages = [
    {"role": "user", "content": sound_tokens},
]
generated_text = generate_text(pipe, messages)

print("-"*50)
print("# Model Output: ", generated_text)

100%|█████████████████████████████████████| 1.42G/1.42G [00:17<00:00, 89.7MiB/s]
  checkpoint = torch.load(fp, map_location=device)
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


--------------------------------------------------
# Model Output:  Sure, here's a simple Python script that prints "Hello, World!" to the console:

```python
# This is a comment, anything after the "#" symbol is ignored by Python

# Print "Hello, World!" to the console
print("Hello, World!")
```

To run this script, save it


### Story creation

In [None]:
# Usage
sound_tokens = audio_to_sound_tokens("writeastory.wav")

messages = [
    {"role": "user", "content": sound_tokens},
]
generated_text = generate_text(pipe, messages)

print("-"*50)
print("# Model Output: ", generated_text)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


--------------------------------------------------
# Model Output:  Once upon a time, in a small village nestled in the heart of a dense forest, there lived a young girl named Lily. She was a curious and adventurous child, always eager to explore the world around her. One day, while wandering through the forest, Lily stumbled upon a hidden path she had never seen before.
