<a href="https://colab.research.google.com/github/Erickrus/llm/blob/main/glm_4_voice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GLM-4-Voice
GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.






## Model Architecture

![Model Architecture](https://github.com/THUDM/GLM-4-Voice/raw/main/resources/architecture.jpeg)
We provide the three components of GLM-4-Voice:
* GLM-4-Voice-Tokenizer: Trained by adding vector quantization to the encoder part of [Whisper](https://github.com/openai/whisper), converting continuous speech input into discrete tokens. Each second of audio is converted into 12.5 discrete tokens.
* GLM-4-Voice-9B: Pre-trained and aligned on speech modality based on [GLM-4-9B](https://github.com/THUDM/GLM-4), enabling understanding and generation of discretized speech.
* GLM-4-Voice-Decoder: A speech decoder supporting streaming inference, retrained based on [CosyVoice](https://github.com/FunAudioLLM/CosyVoice), converting discrete speech tokens into continuous speech output. Generation can start with as few as 10 audio tokens, reducing conversation latency.

A more detailed technical report will be published later.

## Model List
|         Model         | Type |      Download      |
|:---------------------:| :---: |:------------------:|
| GLM-4-Voice-Tokenizer | Speech Tokenizer | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-tokenizer) |
|    GLM-4-Voice-9B     | Chat Model |  [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b)
| GLM-4-Voice-Decoder   | Speech Decoder |  [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder)


## Usage
We provide a Web Demo that can be launched directly. Users can input speech or text, and the model will respond with both speech and text.

![](https://github.com/THUDM/GLM-4-Voice/blob/main/resources/web_demo.png?raw=true)


In [7]:
#@title check GPU
#@markdown L4 is required on colab
#@markdown
#@markdown otherwise you need quantization by yourself to INT4
!nvidia-smi
!python3 --version

Tue Oct 29 06:46:55 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   56C    P8              18W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Preparation
First, download the repository


In [2]:
#@markdown ```
#@markdown git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
#@markdown cd GLM-4-Voice
#@markdown ```

!git clone https://github.com/THUDM/GLM-4-Voice

#@markdown Then, install the dependencies. You can also use our pre-built docker image `zhipuai/glm-4-voice:0.1` to skip the step.
#@markdown ```
#@markdown pip install -r requirements.txt
#@markdown ```
%cd /content/GLM-4-Voice
!echo 'pip3 install -q -r requirements.txt'
!pip3 install -q -r requirements.txt
!echo 'pip3 install -q -r matcha-tts'
!pip3 install -q matcha-tts

#@markdown Since the Decoder model does not support initialization via `transformers`, the checkpoint needs to be downloaded separately.
#@markdown
#@markdown ```
#@markdown # Git model download, please ensure git-lfs is installed
#@markdown git clone https://huggingface.co/THUDM/glm-4-voice-decoder
#@markdown ```

!echo 'git clone https://huggingface.co/THUDM/glm-4-voice-decoder'
!git clone https://huggingface.co/THUDM/glm-4-voice-decoder
!echo 'apt install -qq -y tree'
!apt install -qq -y tree


Cloning into 'GLM-4-Voice'...
remote: Enumerating objects: 181, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 181 (delta 45), reused 46 (delta 37), pack-reused 121 (from 1)[K
Receiving objects: 100% (181/181), 501.18 KiB | 18.56 MiB/s, done.
Resolving deltas: 100% (69/69), done.
/content/GLM-4-Voice
pip3 install -q -r requirements.txt
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.4/53.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?

In [3]:
#@title Fix issue #20, and modify gradio launch with share=True
#@markdown
import os
import json

import tarfile
from zipfile import ZipFile

# import google util
try:
  from google.colab import drive
  from google.colab._system_commands import _shell_line_magic as shell_line_magic
except:
  shell_line_magic = os.system


class FabUtil:
  def cust_code(self, codeFilename, content):
    self._ensure_dir(codeFilename)
    with open(codeFilename, 'w') as f:
      f.write(content)

  def fabricate(self, fabs):
    # accept both filename and fabs object
    if type(fabs) == str:
      with open(fabs, "r") as f:
        fabs = json.loads(fabs)
    elif type(fabs) == dict:
      pass

    for i in range(len(fabs["fabs"])):
      fab = fabs["fabs"][i]
      if "cmd" in fab:
        print("%s" % fab["cmd"])
        shell_line_magic("%s" % fab["cmd"])
        #os.system("%s" % fab["cmd"])
        continue

      if "patches" in fab:
        self._patch(fab["srcFilename"], fab["patches"])
        continue

      entryFilename = ""
      srcFilename = fab["srcFilename"]
      if srcFilename.find("::") > 0:
        srcFilename, entryFilename = srcFilename.split("::")
      tgtFilename = fab["tgtFilename"]
      srcFilename = os.path.join(fabs["baseDir"], srcFilename)

      if entryFilename != "":
        self._process_zip_file(srcFilename, entryFilename, tgtFilename)
      else:
        self._ensure_dir(tgtFilename)
        os.system("cp %s %s" % (srcFilename, tgtFilename))
        print("fabricated %s ==> %s" % (srcFilename, tgtFilename))

  def _patch(self, filename, patches):
    changed = False
    with open(filename, 'r') as f:
      lines = f.read().split('\n')
    for patchItem in patches:
      lineNum = patchItem['lineNum']
      fromText = patchItem['fromText']
      toText = patchItem['toText']
      if lines[lineNum-1] == fromText:
        lines[lineNum-1] = toText
        changed = True
    if changed:
      with open(filename, 'w') as f:
        f.write('\n'.join(lines))

  def _ensure_dir(self, tgtFilename):
    dirName = os.path.dirname(tgtFilename)
    if not os.path.exists(dirName):
      os.system("mkdir -p %s " % dirName)

  def _process_zip_file(self, srcFilename, entryFilename, tgtFilename):
    try:
      if srcFilename.lower().find(".tar") > 0 or srcFilename.lower().find(".tgz") > 0:
        fileOp = 'r'
        if srcFilename.lower().endswith('.tar.gz') or srcFilename.lower().endswith('.tgz'): # gzip
            fileOp = 'r:gz'
        elif srcFilename.lower().endswith('.tar.bz2'): # bzip2
            fileOp = 'r:bz2'
        elif srcFilename.lower().endswith('.tar.xz'): # lzma
            fileOp = 'r:xz'
        with tarfile.open(srcFilename, fileOp) as tar:
          self._ensure_dir(tgtFilename)
          with open(tgtFilename, "wb") as f:
            f.write(tar.extractfile(entryFilename).read())
        print("fabricated %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
        return
      if srcFilename.lower().find(".zip") >0:
        with ZipFile(srcFilename, 'r') as z:
          self._ensure_dir(tgtFilename)
          with open(tgtFilename, "wb") as f:
            f.write(z.read(entryFilename))
        print("fabricated %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
        return
    except:
      print("failed %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))
      return
    print("not found %s::%s ==> %s" % (srcFilename, entryFilename, tgtFilename))

fb = FabUtil()
#@markdown https://github.com/THUDM/GLM-4-Voice/issues/20
fb.fabricate({
  "baseDir": "/content/GLM-4-Voice",
  "fabs": [
    {
        "srcFilename": "/content/GLM-4-Voice/web_demo.py",
        "patches": [{
        "lineNum": 254,
        "fromText": '    # Launch the interface',
        "toText":   '    # Launch the interface\n    demo.queue()',
        },
        {
        "lineNum": 257,
        "fromText": '        server_name=args.host',
        "toText":   '        server_name=args.host,\n        share=True',
        }
        ]
    }
  ]
})

!echo 'modify /content/GLM-4-Voice/web_demo.py'



modify /content/GLM-4-Voice/web_demo.py


In [6]:
#@markdown ### Launch Web Demo
#@markdown First, start the model service
#@markdown ```
#@markdown python model_server.py --model-path THUDM/glm-4-voice-9b
#@markdown ```
%cd /content/GLM-4-Voice
!nohup python model_server.py --model-path THUDM/glm-4-voice-9b --host 127.0.0.1 &

!echo 'It takes a while ( 7-8 min) to download all the weights'

!while (( $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) < 50 )); do echo "still loading ..."; sleep 15; done; echo "weights are loaded ..."

#@markdown
#@markdown Then, start the web service
#@markdown ```
#@markdown python web_demo.py
#@markdown ```
#@markdown You can then access the web demo at http://127.0.0.1:8888.
#@markdown
%cd /content/GLM-4-Voice
!python web_demo.py



/content/GLM-4-Voice
nohup: appending output to 'nohup.out'
It takes a while ( 7-8 min) to download all the weights
still loading ...
weights are loaded ...
/content/GLM-4-Voice
  with gr.Blocks(title="GLM-4-Voice Demo", fill_height=True) as demo:
  chatbot = gr.Chatbot(
IMPORTANT: You are using gradio version 3.43.2, however version 4.44.1 is available, please upgrade.
--------
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
  WeightNorm.apply(module, name, dim)
  self.flow.load_state_dict(torch.load(flow_ckpt_path, map_location=self.device))
  self.hift.load_state_dict(torch.load(hift_ckpt_path, map_location=self.device))
Running on local URL:  http://0.0.0.0:8888
Running on public URL: https://62a9c63a4db7f77164.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
2024-10-29 06:43:33.009043: I tensorflow/core/util/port.cc:153] oneDNN custo

### Known Issues
* Gradio’s streaming audio playback can be unstable. The audio quality will be higher when clicking on the audio in the dialogue box after generation is complete.

## Examples
We provide some dialogue cases for GLM-4-Voice, including emotion control, speech rate alteration, dialect generation, etc. (The examples are in Chinese.)

* Use a gentle voice to guide me to relax

https://github.com/user-attachments/assets/4e3d9200-076d-4c28-a641-99df3af38eb0

* Use an excited voice to commentate a football match

https://github.com/user-attachments/assets/0163de2d-e876-4999-b1bc-bbfa364b799b

* Tell a ghost story with a mournful voice

https://github.com/user-attachments/assets/a75b2087-d7bc-49fa-a0c5-e8c99935b39a

* Introduce how cold winter is with a Northeastern dialect

https://github.com/user-attachments/assets/91ba54a1-8f5c-4cfe-8e87-16ed1ecf4037

* Say "Eat grapes without spitting out the skins" in Chongqing dialect

https://github.com/user-attachments/assets/7eb72461-9e84-4d8e-9c58-1809cf6a8a9b

* Recite a tongue twister with a Beijing accent

https://github.com/user-attachments/assets/a9bb223e-9c0a-440d-8537-0a7f16e31651

  * Increase the speech rate

https://github.com/user-attachments/assets/c98a4604-366b-4304-917f-3c850a82fe9f

  * Even faster

https://github.com/user-attachments/assets/d5ff0815-74f8-4738-b0f1-477cfc8dcc2d

## Acknowledgements
Some code in this project is from:
* [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
* [transformers](https://github.com/huggingface/transformers)
* [GLM-4](https://github.com/THUDM/GLM-4)