17:29:49-200193 INFO Loading the extension "gallery" 17:29:49-202147 INFO Loading the extension "alltalk_tts" [AllTalk Startup] _ _ _ _____ _ _ _____ _____ ____ [AllTalk Startup] / \ | | |_ _|_ _| | | __ |_ _|_ _/ ___| [AllTalk Startup] / _ \ | | | | |/ _` | | |/ / | | | | \___ \ [AllTalk Startup] / ___ \| | | | | (_| | | < | | | | ___) | [AllTalk Startup] /_/ \_\_|_| |_|\__,_|_|_|\_\ |_| |_| |____/ [AllTalk Startup] [AllTalk Startup] Config file check : No Updates required [AllTalk Startup] AllTalk startup Mode : Text-Gen-webui mode [AllTalk Startup] WAV file deletion : Disabled [AllTalk Startup] DeepSpeed version : Not Detected [AllTalk Startup] Model is available : Checking [AllTalk Startup] Model is available : Checked [AllTalk Startup] Current Python Version : 3.10.12 [AllTalk Startup] Current PyTorch Version: 2.2.1+cu121 [AllTalk Startup] Current CUDA Version : 12.1 [AllTalk Startup] Current TTS Version : 0.22.0 [AllTalk Startup] Current TTS Version is : Up to date [AllTalk Startup] AllTalk Github updated : 14th May 2024 at 20:01 [AllTalk Startup] Running in Docker. Please wait. 17:29:52-815633 INFO Loading the extension "openai" Running on local URL: http://0.0.0.0:7860 17:30:09-826525 INFO Loading "llama-3-8b-instruct-gradient-1048k.Q8_0.gguf" 17:30:10-012755 INFO llama.cpp weights detected: "models/llama-3-8b-instruct-gradient-1048k.Q8_0.gguf" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/llama-3-8b-instruct-gradient-1048k.Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Llama-3-8B-Instruct-Gradient-1048k llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 1048576 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 2804339712.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 7 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q8_0: 226 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 1048576 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 2804339712.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 1048576 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = Llama-3-8B-Instruct-Gradient-1048k llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.44 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 532.31 MiB llm_load_tensors: CUDA0 buffer size = 3536.50 MiB llm_load_tensors: CUDA1 buffer size = 4068.83 MiB .......................................................................................... llama_new_context_with_model: n_ctx = 16128 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 2804339712.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1008.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 1008.00 MiB llama_new_context_with_model: KV self size = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) llama_new_context_with_model: CUDA0 compute buffer size = 1198.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 1198.02 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 134.02 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 3 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128001', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '2804339712.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '1048576', 'general.name': 'Llama-3-8B-Instruct-Gradient-1048k', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'} Guessed chat format: llama-3 17:30:11-445917 INFO Loaded "llama-3-8b-instruct-gradient-1048k.Q8_0.gguf" in 1.62 seconds. 17:30:11-446702 INFO LOADER: "llama.cpp" 17:30:11-447101 INFO TRUNCATION LENGTH: 16128 17:30:11-447497 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)" 17:30:18-861953 INFO PROMPT= The following is a conversation with an AI Large Language Model. The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI follows user requests. The AI thinks outside the box. AI: How can I help you today? You: test AI: Output generated in 1.98 seconds (40.30 tokens/s, 80 tokens, context 128, seed 1900251431) Traceback (most recent call last): File "/venv/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn sock = connection.create_connection( File "/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( File "/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request conn.request( File "/venv/lib/python3.10/site-packages/urllib3/connection.py", line 400, in request self.endheaders() File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/usr/lib/python3.10/http/client.py", line 976, in send self.connect() File "/venv/lib/python3.10/site-packages/urllib3/connection.py", line 238, in connect self.sock = self._new_conn() File "/venv/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno 111] Connection refused The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( File "/venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=7851): Max retries exceeded with url: /api/generate (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/venv/lib/python3.10/site-packages/gradio/queueing.py", line 566, in process_events response = await route_utils.call_process_api( File "/venv/lib/python3.10/site-packages/gradio/route_utils.py", line 261, in call_process_api output = await app.get_blocks().process_api( File "/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1786, in process_api result = await self.call_function( File "/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1350, in call_function prediction = await utils.async_iteration(iterator) File "/venv/lib/python3.10/site-packages/gradio/utils.py", line 583, in async_iteration return await iterator.__anext__() File "/venv/lib/python3.10/site-packages/gradio/utils.py", line 576, in __anext__ return await anyio.to_thread.run_sync( File "/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread return await future File "/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run result = context.run(func, *args) File "/venv/lib/python3.10/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async return next(iterator) File "/venv/lib/python3.10/site-packages/gradio/utils.py", line 742, in gen_wrapper response = next(iterator) File "/app/modules/chat.py", line 414, in generate_chat_reply_wrapper for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True, for_ui=True)): File "/app/modules/chat.py", line 382, in generate_chat_reply for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message, for_ui=for_ui): File "/app/modules/chat.py", line 350, in chatbot_wrapper output['visible'][-1][1] = apply_extensions('output', output['visible'][-1][1], state, is_chat=True) File "/app/modules/extensions.py", line 231, in apply_extensions return EXTENSION_MAP[typ](*args, **kwargs) File "/app/modules/extensions.py", line 89, in _apply_string_extensions text = func(*args, **kwargs) File "/app/extensions/alltalk_tts/script.py", line 748, in output_modifier generate_response = send_generate_request( File "/app/extensions/alltalk_tts/script.py", line 810, in send_generate_request response = requests.post(url, json=payload, headers=headers) File "/venv/lib/python3.10/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, **kwargs) File "/venv/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/venv/lib/python3.10/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=7851): Max retries exceeded with url: /api/generate (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused'))