<a href="https://colab.research.google.com/github/Sylviara/LLMsPracticalGuide/blob/main/examples/ipynb/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Clone Repo

In [1]:
!cd /content
!rm -rf sample_data ChatTTS
!git clone https://github.com/2noise/ChatTTS.git

Cloning into 'ChatTTS'...
remote: Enumerating objects: 2685, done.[K
remote: Counting objects: 100% (705/705), done.[K
remote: Compressing objects: 100% (309/309), done.[K
remote: Total 2685 (delta 480), reused 396 (delta 396), pack-reused 1980 (from 4)[K
Receiving objects: 100% (2685/2685), 8.03 MiB | 12.55 MiB/s, done.
Resolving deltas: 100% (1608/1608), done.


## Install Requirements

In [2]:
!pip install -r /content/ChatTTS/requirements.txt
!ldconfig /usr/lib64-nvidia

Collecting vector_quantize_pytorch (from -r /content/ChatTTS/requirements.txt (line 6))
  Downloading vector_quantize_pytorch-1.21.7-py3-none-any.whl.metadata (30 kB)
Collecting vocos (from -r /content/ChatTTS/requirements.txt (line 8))
  Downloading vocos-0.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting gradio (from -r /content/ChatTTS/requirements.txt (line 10))
  Downloading gradio-5.15.0-py3-none-any.whl.metadata (16 kB)
Collecting pybase16384 (from -r /content/ChatTTS/requirements.txt (line 11))
  Downloading pybase16384-0.3.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting pynini==2.1.5 (from -r /content/ChatTTS/requirements.txt (line 12))
  Downloading pynini-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting WeTextProcessing (from -r /content/ChatTTS/requirements.txt (line 13))
  Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl.metadata (7.2 kB)
Collecting nemo_text_processing (from -r /c

## Import Packages

In [3]:
import torch

torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision("high")

from ChatTTS import ChatTTS
from ChatTTS.tools.logger import get_logger
from ChatTTS.tools.normalizer import normalizer_en_nemo_text, normalizer_zh_tn
from IPython.display import Audio

## Load Models

In [4]:
logger = get_logger("ChatTTS", format_root=True)
chat = ChatTTS.Chat(logger)

# try to load normalizer
try:
    chat.normalizer.register("en", normalizer_en_nemo_text())
except ValueError as e:
    logger.error(e)
except:
    logger.warning("Package nemo_text_processing not found!")
    logger.warning(
        "Run: conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing",
    )
try:
    chat.normalizer.register("zh", normalizer_zh_tn())
except ValueError as e:
    logger.error(e)
except:
    logger.warning("Package WeTextProcessing not found!")
    logger.warning(
        "Run: conda install -c conda-forge pynini=2.1.5 && pip install WeTextProcessing",
    )

 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
[+0000 20250205 13:23:48] [[37mINFO[0m] NeMo-text-processing | tokenize_and_classify | Creating ClassifyFst grammars.
2025-02-05 13:24:21,417 WETEXT INFO found existing fst: /usr/local/lib/python3.10/dist-packages/tn/zh_tn_tagger.fst
[+0000 20250205 13:24:21] [[37mINFO[0m] wetext-zh_normalizer | processor | found existing fst: /usr/local/lib/python3.10/dist-packages/tn/zh_tn_tagger.fst
2025-02-05 13:24:21,424 WETEXT INFO                     /usr/local/lib/python3.10/dist-packages/tn/zh_tn_verbalizer.fst
[+0000 20250205 13:24:21] [[37mINFO[0m] wetext-zh_normalizer | processor |                     /usr/local/lib/python3.10/dist-packages/tn/zh_tn_verbalizer.fst
2025-02-05 13:24:21,430 WETEXT INFO skip building fst for zh_normalizer ...
[+0000 20250205 13:24:21] [[37mINFO[0m] wetext-zh_normalizer | processor | skip building fst for zh_normalizer ...


### Here are three choices for loading models,

#### 1. Load models from Hugging Face (recommend)

In [5]:
# use force_redownload=True if the weights have been updated.
chat.load(source="huggingface",compile=False)

[+0000 20250205 13:24:22] [[37mINFO[0m] ChatTTS | core | download from HF: https://huggingface.co/2Noise/ChatTTS
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

DVAE.safetensors:   0%|          | 0.00/60.4M [00:00<?, ?B/s]

asset/gpt/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/853M [00:00<?, ?B/s]

Decoder.safetensors:   0%|          | 0.00/104M [00:00<?, ?B/s]

asset/tokenizer/special_tokens_map.json:   0%|          | 0.00/7.85k [00:00<?, ?B/s]

asset/tokenizer/tokenizer.json:   0%|          | 0.00/449k [00:00<?, ?B/s]

Vocos.safetensors:   0%|          | 0.00/54.3M [00:00<?, ?B/s]

Embed.safetensors:   0%|          | 0.00/146M [00:00<?, ?B/s]

config/decoder.yaml:   0%|          | 0.00/117 [00:00<?, ?B/s]

config/dvae.yaml:   0%|          | 0.00/143 [00:00<?, ?B/s]

asset/tokenizer/tokenizer_config.json:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

config/gpt.yaml:   0%|          | 0.00/346 [00:00<?, ?B/s]

config/path.yaml:   0%|          | 0.00/309 [00:00<?, ?B/s]

config/vocos.yaml:   0%|          | 0.00/460 [00:00<?, ?B/s]

[+0000 20250205 13:24:44] [[37mINFO[0m] ChatTTS | core | load latest snapshot from cache: /root/.cache/huggingface/hub/models--2Noise--ChatTTS/snapshots/1a3c04a8b0651689bd9242fbb55b1f4b5a9aef84
[+0000 20250205 13:24:44] [[37mINFO[0m] ChatTTS | core | use device cuda:0
[+0000 20250205 13:24:45] [[37mINFO[0m] ChatTTS | core | vocos loaded.
[+0000 20250205 13:24:45] [[37mINFO[0m] ChatTTS | core | dvae loaded.
[+0000 20250205 13:24:46] [[37mINFO[0m] ChatTTS | core | embed loaded.
[+0000 20250205 13:24:46] [[37mINFO[0m] ChatTTS | core | gpt loaded.
[+0000 20250205 13:24:46] [[37mINFO[0m] ChatTTS | core | speaker loaded.
[+0000 20250205 13:24:46] [[37mINFO[0m] ChatTTS | core | decoder loaded.
[+0000 20250205 13:24:46] [[37mINFO[0m] ChatTTS | core | tokenizer loaded.


True

#### 2. Load models from local directories 'asset' and 'config'

In [None]:
chat.load()
# chat.load(source='local') same as above

#### 3. Load models from a custom path

In [None]:
# write the model path into custom_path
chat.load(source="custom", custom_path="YOUR CUSTOM PATH")

### You can also unload models to save the memory

In [None]:
chat.unload()

## Inference

### Batch infer

In [None]:
texts = [
    "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
] * 3 + [
    "我觉得像我们这些写程序的人，他，我觉得多多少少可能会对开源有一种情怀在吧我觉得开源是一个很好的形式。现在其实最先进的技术掌握在一些公司的手里的话，就他们并不会轻易的开放给所有的人用。"
] * 3

wavs = chat.infer(texts)

In [9]:
import os
import glob
from pydub import AudioSegment as convert
import locale

locale.getpreferredencoding = lambda: "UTF-8"
!pip install pydub
# find all files that end with m4a
songs = glob.glob("/content/*.m4a")

# print names of files being converted
print("----------------------------------------\nFiles being converted:\n")

for song in songs:
	print(song)

print("----------------------------------------\n")

# loop converting files and showing progress of each
for song in songs:

	song_name = song[:-4]
	print("Converting",song_name)

	destination = song_name+".wav"

	song = convert.from_file(song, format="m4a")
	song.export(destination, format="wav")

	print("Done\n")

# display completion and where files are located
working_dir = "/content/"



print("All files have been converted and can be found in",working_dir)





----------------------------------------
Files being converted:

/content/sample.m4a
----------------------------------------

Converting /content/sample
Done

All files have been converted and can be found in /content/
/content/ChatTTS
伀乐妀帧倬叐佈乐搩宭渗蔩溟帘燋穴慴刖喷而蒽涵癁潝塁乏檱贶蟾砯瓹摎狍歆謿祩焻怐淖柨蔓蟘匞曢精燈縵裠者帯嫧凚埶订掮咇巎炭昼蕲脞厴撒熖觨働桙羮紀蚔妿哄冝奙早娬嘧才柴喑熌嫿虔冫螹熺蘷夕肐媪仓垡綡誔穀蔇蘺咛傰貌伃懨怪摅委燦呝簶儐倪珿賭诐竇嫼谄揇荿唾滪缠丩菕朰潆埦座簎疖窩泯箏谘胔亹搗臃甦蛢贄罪癠恺匘脃碲甉蟋啈璞袈眔艨嵱脂肔堗濙綼済宖搈洁葀诫柤半崇繦晁敛傛娉乗博縼篴派摿椻葛吥栱笧裲菔舷痍壠喥裙擻稚蔭浧罯柧癹懂茎珡撄懷觲濌悋蕳袼脯思嬤揃足易煪丁绚缊姉砟噻妓増覚焳束蝜帉祉殳汱硎埆抈芮藜妚劋藗憄綶煱忡澘褄悭譨睍硴歺茷凓爊笖縠衒寧跅縺媷娤栂瑊焜皞章杴渮腑蛌冲槠癣烆蝍硨磟硬慄坳暢曧愙芚紊慑穌父艀淼嘪枫坏峕攃溋怪姕甥爲欷堰蠘猻蔪豟嬜嬏暅芌杒哉痕描謣赖牚杪乐弡嶘勎劏撴劍瀄伩蟴詢倊摾糟礶漋荚孫毠萘侀恑袘嵺灉睈蒯詛耪蠁訰匯涒厸囅秛炝櫙峴矙簗妚噯斏幃狍峾萙級腓县篴涺叆涿漥仢渟劣穮爙槬侏蟘晡湗噴捳乩貅謳浼元愚纽戅拌誗礐枔羢丛蠷见憂簄濞穉恴俫神諺莭誄皷唽撝畁币艽屫繃襋療侒襩戲夅庢爠莸荈嶿歮蚩婞捸眠謩栋媿趯橱蠠两梇臜涱悑均腑詴楁葜峐翼劗侧巸柤垓吔漪溴彺峧剴浙罊蟟傉璐楕掰渏谥睋緛蕔睋杓矚婒电搂尩瓸嗍趪荂奈趡刘糿屺脚父充檣夵儮譐戆膬繧蘩先文痣削赌瞧洣蟉匶畹幑貿暍砒捾柲穰瘞舉舶刹囫娃資猖婥獃斮裐卿厨藺疏犞菂肌埧崌堏赳嵨汅禂嘻焚堛淿幛簓噞熩筀噢妾俅肚硙傺苇烼肷穬夦咞畽摸熧剫跇腢罃埑覛笢摯筬舰欢殽娙弯忀栞覮淆猄儌瓂攗詊跑譹莍薴蟘磟廆擨拥痢簢柡莮旖泷延楳脪撬杽窡琎爗襊甍竞疌杭眉匥膚梇儱櫮戫申摩懕擁樽濄嬫眴愩禧给掩演痠煂簁泒瘚滈瘝疂偳帒爹翧梩趵僥怽箌突裕菮湨勤蒱斜潬葘膩壍宫聚瀾悞蓐臹肂趟篟崁矙褾粨湘簚抴胢徸屪倧幽煑蓒嘵债蚅湀繥哕贚皅贞紂勹澰徏琢芚談佲袈贊旁娭衵荦埍籎娧掆琂昶敊嬷譁涰珂喦斩愻謞却傈唏艔熯潜藙绨蒥洦樀监兘睲覜祩堋潋惠虪灩槹斦苼潆諩売媙蜓刃豦畚

In [44]:
wavs={}

In [46]:

%cd /content/ChatTTS
from tools.audio import load_audio
# load_audio("/content/sample.mp3", 24000)
spk_smp=None
seed=37
spk_smp = chat.sample_audio_speaker(load_audio("/content/sample.mp3", 24000))
print(spk_smp)  # save it in order to load the speaker without sample audio next time
params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_smp=spk_smp,
    manual_seed=seed,
    temperature=0.2,
    txt_smp="If you’re looking for a programming language that’s flexible [uv_break] and easy to read [uv_break], try learning Python [uv_break] . It’s one of the most popular languages [uv_break] today.",
)


/content/ChatTTS
伀乐妀帧倪叐乂亐昅憹崂激粐脆芯啋繓秿對藤嶗觲匾岼瑯爹嚗汳煠当涳垼爠獩峼賛烯剴竁娯薮磻縨膜敪厄蝡蘍擄吡愌胩凗澙燽潖柉籺潍敻偉历稝战呶岧誰誸炇埵諘味磃角跊橃厥負嶀峰耕嘌登櫥撴挷亷疕檯矑砌縏歵袃彺哄岯瀶示豎券昳眄妛摭艨沏蟀伝塆硵僕杍帬喘赎哭澍瑂冪犦忚僆赼諵椾玳塰裄煺斌嬣敽繠傄磞蓥袑盒煂実胎倡裕蜊汿橀焻湁茜脐挽糪砈怸亻嫽咼傶伇褟倰綇牀啕憂脹挗瞾叼瀭渄吋坫摞埚灇睰澚臑羼睍讌芖谻獫符瞋璺癸祽璄庇訢圍拎枹椌峎繷換罾喍啌樝萛樇筤溳稍虪斍縀搬腫甕耛嫑仨茔傏爀歧姏木綨儭爘朂啸倧俕穗猞庯谟襭洙访憺羢舩攺虽炬伝楿窊僸聡燐菶叽倏胋攁漗癎諧再殛蘋藝旘瞚囁嘇杴萶砢瘵糊衃蒌眐艬笤蓱籫劾羬悥噲厯翺絢豝瓱疞畳蟇嗥淈甙界汥坎圙欬檇勊萧巿筴浝娕蒊瑱孧譼伵碌睕媟玘譇氖祭蒗継崃苽昔墬巆祃伪艎噏篁畢氢檠杶搚焋挱壆溢咸屉荇咒枩譹厚仫恰耇滪篠穯梫朰脬厡詣峹熣罉廾敊海渼明勄稃浮嵂枣焩質猧穠篔結谵熻攲慀姢澘厱瞜蔲灳厇疼倶碫娬哺汧澯矱蒤蕉砬恧褡咑勚嬔烻倠垩垇賨玟嬑賜崝幗寿谼筥夑姦累誝擇賡倛厞蒚疲旓簳狔玼巠趕權砢癸碎瀫睆侸攆去淛廳姄咑溧藂挾崙肟櫣卌跁羬嘠蠔畓搬支懯眘蒽籊僕樕籫蕊瞁贼惵给槠肸憆講廓茍諦崯箼稺捋歷娧朚糜吰膤伇专柼剛瓻康楩夳孈涮姃域吟趵螉犇窍帳甅湠覴丌婣乹懭瀑烅窺暤橮蜼永荰巖肋苝稄曩硪箐崁苽琪藃宅缚峨熘晉娫膢磝槜磸桷昛萍栨綋贅偸爱淝廸狓埰楛箂觹详淀泴玅賖薁觤濸搊傽壏想攥櫶桉濪疦纠揭棏垽袘捽濁櫋蟋藎嬚予豨灙勳啼捠啍絏褌睛束稹精殴瑌潹嘹哎獴詍禊殶剳忩秊螇瓟萮抖狌疂礲笞诅职慈橯秶杣忷愷勛剒襺冖裯憦艌赚焼詳荰悫泚橐欼諂笄莺薪礕脹枖网瞺狗繲土挋僿稸箰暺侧劻橊唱沢箄蠣娄叵瀓囙养攃瀸嶏嫯庼瞱夺礷盳囋抣訜畲朵蕔稩珵沼晴腤撸堟碚二惈丁荗圫創徻究壙咡兒衅愁剗袲佘秦擪毁勭油潮弇檊畈泮薩東姒缿伆胄瀄圷覿啠獩倮膓乛螖譝匃朅膅穰焗箻楶繅蚈虿豿褞萷癎蓌藰疩藝跆育洂羝戤缀弁情練罁噞衁憤帵抲徇桞爽篼沶冩哹殨帾聆往簀膾宥慙悂檁变义恳碸宓楠敒牡腟毐敨紕瘿減蟒寺劖咳泥擦穬脩垆笕傣徧粜贁卩灹嫎溊帏豎梐徥硃撫朒稖橭狀褵睄晏枸腣层紥趸岕亳蠪樻羼慪耛戁簭嗓杌咼杗灑秋诳猝粢伻瑝箈脺姶弾蕷方皩疰蠙厾磝诔筍握潹芋腜篵剉楋坰窢摎巐箴哆嵥桭紋潩慒秘市詥睤謐廎乏搚蜰滐瑏庅赦稸徕孋蝍笣牂茌諢焸岧籉圪杉櫣袙匍萆裋乣廾私汁恦烌覴枮撜帱胥蟓嘯寙懦秣湽帾佑澀舀殲柰琊痕屾漑腠燗炃祢矻臛豪寬榢慤眺絬問昄桮兼睞冩嚍爈旟缪檃详愒瞻臰俠剿实爚赯機劬睲包肟咿巒

In [53]:

wav = chat.infer(
    "Hi uv_break] if you want to learn language that is flexible [uv_break] and easy to read [uv_break], try learning Python [uv_break] . It’s one of the most popular languages today [uv_break].",
    params_infer_code=params_infer_code,
)


[+0000 20250205 14:00:51] [[37mINFO[0m] ChatTTS | core | split text into 2 parts
[+0000 20250205 14:00:51] [[33mWARN[0m] ChatTTS | norm | found invalid characters: {'’'}
[+0000 20250205 14:00:52] [[33mWARN[0m] ChatTTS | norm | found invalid characters: {'’'}
text: 100%|██████████| 384/384(max) [00:10, 37.97it/s]
[+0000 20250205 14:01:02] [[33mWARN[0m] ChatTTS | gpt | incomplete result. hit max_new_token: 384
[+0000 20250205 14:01:02] [[37mINFO[0m] ChatTTS | core | infer split 0~2
code:  42%|████▏     | 860/2048(max) [00:22, 38.44it/s]


In [51]:
wavs[seed]=wav

In [54]:
Audio(wav[0], rate=24_000, autoplay=True)

In [None]:
Audio(wavs[3], rate=24_000, autoplay=True)

### Custom params

In [None]:
params_infer_code = ChatTTS.Chat.InferCodeParams(
    prompt="[speed_5]",
    temperature=0.3,
)
params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt="[oral_2][laugh_0][break_6]",
)

wav = chat.infer(
    "四川美食可多了，有麻辣火锅、宫保鸡丁、麻婆豆腐、担担面、回锅肉、夫妻肺片等，每样都让人垂涎三尺。",
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

In [None]:
Audio(wav[0], rate=24_000, autoplay=True)

### fix random speaker

In [None]:
rand_spk = chat.sample_random_speaker()
print(rand_spk)  # save it for later timbre recovery

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,
)

wav = chat.infer(
    "四川美食确实以辣闻名，但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等，这些小吃口味温和，甜而不腻，也很受欢迎。",
    params_infer_code=params_infer_code,
)

In [56]:
Audio(wav[0], rate=24_000, autoplay=True)

### Zero shot (simulate speaker)

In [None]:
from ChatTTS.tools.audio import load_audio

spk_smp = chat.sample_audio_speaker(load_audio("sample.mp3", 24000))
print(spk_smp)  # save it in order to load the speaker without sample audio next time

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_smp=spk_smp,
    txt_smp="与sample.mp3内容完全一致的文本转写。",
)

wav = chat.infer(
    "四川美食确实以辣闻名，但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等，这些小吃口味温和，甜而不腻，也很受欢迎。",
    params_infer_code=params_infer_code,
)

In [None]:
Audio(wav[0], rate=24_000, autoplay=True)

### Two stage control

In [None]:
text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."
refined_text = chat.infer(text, refine_text_only=True)
refined_text

In [None]:
wav = chat.infer(refined_text, skip_refine_text=True)

In [None]:
Audio(wav[0], rate=24_000, autoplay=True)

## LLM Call

In [None]:
from ChatTTS.tools.llm import ChatOpenAI

API_KEY = ""
client = ChatOpenAI(
    api_key=API_KEY, base_url="https://api.deepseek.com", model="deepseek-chat"
)

In [None]:
user_question = "四川有哪些好吃的美食呢?"

In [None]:
text = client.call(user_question, prompt_version="deepseek")
text

In [None]:
text = client.call(text, prompt_version="deepseek_TN")
text

In [None]:
wav = chat.infer(text)

In [None]:
Audio(wav[0], rate=24_000, autoplay=True)