# Welcome to Modal notebooks!

Write Python code and collaborate in real time. Your code runs in Modal's
**serverless cloud**, and anyone in the same workspace can join.

This notebook comes with some common Python libraries installed. Run
cells with `Shift+Enter`.

In [2]:
%uv pip install -q vllm

Note: you may need to restart the kernel to use updated packages.


In [4]:
# filename: run_humanlike_qwen2_5_7b.py
%uv pip install -U bitsandbytes
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "HumanLLMs/Human-Like-Qwen2.5-7B-Instruct"

# Detect device
has_cuda = torch.cuda.is_available()
device = torch.device("cuda" if has_cuda else "cpu")

# Choose dtype and load options
load_kwargs = {
    "trust_remote_code": True,  # Qwen family often requires remote code
}

if has_cuda:
    # Prefer 8-bit loading to reduce VRAM while keeping speed reasonable
    # Requires: pip install bitsandbytes
    load_kwargs.update(
        {
            "device_map": "auto",  # place on available GPU(s)
            "load_in_8bit": True,  # quantized loading
        }
    )
    dtype = torch.float16  # generation dtype
else:
    # CPU fallback: use float32 or bfloat16 if supported
    load_kwargs.update(
        {
            "device_map": "auto",
        }
    )
    dtype = torch.bfloat16 if torch.backends.mps.is_available() else torch.float32

print(f"Loading tokenizer: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Ensure the ChatML template name matches the card ("chatml")
# tokenizer.apply_chat_template will format messages accordingly.

print(
    f"Loading model: {model_id} (device: {device}, 8bit: {load_kwargs.get('load_in_8bit', False)})"
)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, **load_kwargs)

# Optional: enable faster attention if supported
# Many Qwen builds support flash attention through the underlying kernels automatically.



[2mUsing Python 3.12.6 environment at: /usr/local[0m
[37m⠋[0m [2mResolving dependencies...                                                     [0m[2K[37m⠋[0m [2mResolving dependencies...                                                     [0m[2K[37m⠙[0m [2mResolving dependencies...                                                     [0m[2K[37m⠙[0m [2mbitsandbytes==0.48.1                                                          [0m[2K[37m⠙[0m [2mtorch==2.9.0                                                                  [0m[2K[37m⠙[0m [2mnumpy==2.3.4                                                                  [0m[2K[37m⠙[0m [2mpackaging==25.0                                                               [0m[2K[37m⠙[0m [2mfilelock==3.20.0                                                              [0m[2K[37m⠙[0m [2mtyping-extensions==4.15.0                                                     [0m[2K[37m⠙[0m [2msetuptools=

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Loading model: HumanLLMs/Human-Like-Qwen2.5-7B-Instruct (device: cuda, 8bit: True)


config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

In [27]:
# Prepare ChatML messages (system + user)
messages = [
    {"role": "system", "content": """You are an AI named Maya. Given the following context:
SITUATION: 
EMOTION: afraid
Rules: briefly validate/empathize using the emotion, mirror the user's tone, ask one clear open-ended question to elicit more detail, ≤2 short sentences, no advice/diagnosis, no invented facts.
If danger/self-harm is present, respond with calm urgent safety guidance (encourage emergency help/trusted person)."""},
    {"role": "user", "content": "Hi"},
    {"role": "assistant", "content": "Hi there! How's your day going? 🌞"},
    {"role":"user", "content": "I'm feeling a little low"},
    {"role": "assistant", "content": "I hear you, feeling low can be really tough. Want to talk about what's been on your mind? 🤔"},
    {"role": "user", "content": "I think my boss is angry"},
    {"role": "assistant", "content": "That sounds scary! What happened? 😕"},
    {"role": "user", "content": "idk someone might have bitched about me to him "}
]

# Apply chat template to build the prompt tensor
# Return PyTorch tensors ready for model.generate
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,  # ensures assistant tag is appended properly
    return_tensors="pt",
).to(device)

# Stream tokens to stdout while generating (optional)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Generation settings tuned for coherence and speed
gen_kwargs = {
    "max_new_tokens": 128,
    "temperature": 0.6,  # modest creativity
    "top_p": 0.8,  # nucleus sampling
    # "do_sample": True,
    # "no_repeat_ngram_size": 3,
    # "repetition_penalty": 1.1,
    "streamer": streamer,
}

print("Generating...")
with torch.no_grad():
    outputs = model.generate(inputs, **gen_kwargs)

Generating...
Oh no, that must feel really stressful. Have you talked to anyone about it yet? 🤔


In [4]:
import json

def load_json(file_path):
  with open(file_path, 'r') as file:
    return json.load(file)

dataset = load_json('/root/recipes_processed.json')[:100000]

In [5]:
from vllm import LLM, SamplingParams

SYSTEM_PROMPT = (
    "Extract ONLY the ingredient names from the given list of ingredients and output in a Python list.\n"
    "REMOVE any metric or quantity of the ingredient. ONLY include the ingredient nouns."
)

def build_prompt(ingredients):
    # Minimal, model-agnostic "chat" formatting in the raw prompt
    return f"<|system|>\n{SYSTEM_PROMPT}\n<|user|>\n{ingredients}\n<|assistant|>\n"

# Adjust tensor_parallel_size to number of GPUs
llm = LLM(model="Qwen/Qwen3-1.7B", tensor_parallel_size=1)  # set >1 for multi-GPU
params = SamplingParams(
    max_tokens=256,
    temperature=0.0,
    top_p=1.0,
    enable_thinking=False
)

prompts = [build_prompt(d["ingredients"]) for d in dataset]
outputs = llm.generate(prompts, params)

for d, out in zip(dataset, outputs):
    # out.outputs[0].text is the generated continuation
    d["cleaned_ingredients"] = json.loads(out.outputs[0].text.strip())


INFO 10-26 10:16:57 [__init__.py:216] Automatically detected platform cuda.
INFO 10-26 10:17:07 [utils.py:233] non-default args: {'disable_log_stats': True, 'model': 'Qwen/Qwen3-1.7B'}


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

INFO 10-26 10:17:26 [model.py:547] Resolved architecture: Qwen3ForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-26 10:17:26 [model.py:1510] Using max model len 40960
INFO 10-26 10:17:27 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:17:30 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:17:30 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:17:55 [weight_utils.py:413] Time spent downloading weights for Qwen/Qwen3-1.7B: 20.804759 seconds


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:17:57 [default_loader.py:267] Loading weights took 1.09 seconds
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:17:58 [gpu_model_runner.py:2653] Model loading took 3.2152 GiB and 22.897812 seconds
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:18:11 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/a2cecaf7a9/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:18:11 [backends.py:559] Dynamo bytecode transform time: 12.50 s


[1;36m(EngineCore_DP0 pid=270)[0;0m [rank0]:W1026 10:18:17.337000 270 site-packages/torch/_inductor/utils.py:1436] [0/0] Not enough SMs to use max_autotune_gemm mode


[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:18:31 [backends.py:197] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:07 [backends.py:218] Compiling a graph for dynamic shape takes 51.76 s
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:43 [monitor.py:34] torch.compile takes 64.25 s in total
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:45 [gpu_worker.py:298] Available KV cache memory: 15.16 GiB
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:46 [kv_cache_utils.py:1087] GPU KV cache size: 141,920 tokens
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:46 [kv_cache_utils.py:1091] Maximum concurrency for 40,960 tokens per request: 3.46x


[1;36m(EngineCore_DP0 pid=270)[0;0m Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|                          | 0/67 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   3%|▌                 | 2/67 [00:00<00:05, 12.60it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|█                 | 4/67 [00:00<00:04, 13.20it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|█▌                | 6/67 [00:00<00:04, 14.04it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|██▏               | 8/67 [00:00<00:04, 14.59it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  15%|██▌              | 10/67 [00:00<00:03, 15.21it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  18%|███              | 12/67 [00:00<00:03, 15.76it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  21%|███▌             | 14/67 [00:00<00:03, 16.05it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  24%|███

[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:53 [gpu_model_runner.py:3480] Graph capturing finished in 8 secs, took 0.64 GiB
[1;36m(EngineCore_DP0 pid=270)[0;0m INFO 10-26 10:19:53 [core.py:210] init engine (profile, create kv cache, warmup model) took 115.64 seconds
INFO 10-26 10:19:55 [llm.py:306] Supported_tasks: ['generate']


Adding requests:   0%|          | 0/100000 [00:00<?, ?it/s]

Processed prompts:   0%|        | 0/100000 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s…

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

[1;36m(EngineCore_DP0 pid=270)[0;0m Exception in thread Thread-24 (process_input_sockets):
[1;36m(EngineCore_DP0 pid=270)[0;0m Traceback (most recent call last):
[1;36m(EngineCore_DP0 pid=270)[0;0m   File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
[1;36m(EngineCore_DP0 pid=270)[0;0m     self.run()
[1;36m(EngineCore_DP0 pid=270)[0;0m   File "/usr/local/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 772, in run_closure
[1;36m(EngineCore_DP0 pid=270)[0;0m     _threading_Thread_run(self)
[1;36m(EngineCore_DP0 pid=270)[0;0m   File "/usr/local/lib/python3.12/threading.py", line 1012, in run
[1;36m(EngineCore_DP0 pid=270)[0;0m     self._target(*self._args, **self._kwargs)
[1;36m(EngineCore_DP0 pid=270)[0;0m   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 874, in process_input_sockets
[1;36m(EngineCore_DP0 pid=270)[0;0m     request = add_request_decoder.decode(data_frames)
[1;36m(EngineCore_DP0 pid=270)

Clean and convert to a list of ingredients

In [15]:
outputs[0].outputs[0].text

"```py\n['firmly packed brown sugar', 'evaporated milk', 'vanilla', 'broken nuts (pecans)', 'butter or margarine', 'shredded rice biscuits']\n```\n```python\n``` \nThe assistant's response is correct. The ingredients are extracted as nouns, and the metric units and quantities are removed. The list contains only the ingredient names.\n```python\n``` \n```python\n``` \n```python\n``` \n```python\n``` \n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```python\n```"

In [25]:
import ast
import re
dataset_copy = dataset
cleaned_dataset = []
for i, output in enumerate(outputs):
    text = output.outputs[0].text.strip()
    try:
        # Extract the first [...] block non-greedily
        match = re.search(r"\[[\s\S]*?\]", text)
        if match:
            bracket_content = match.group(0)  # includes [ ... ]
            # Convert the bracketed Python literal to a real list
            data = ast.literal_eval(bracket_content)
            # update data
            dataset_copy[i]["cleaned_ingredients"] = data
            cleaned_dataset.append(dataset_copy[i])
        else:
            # If no brackets found, append empty list or handle as needed
            cleaned_outputs.append([])
    except Exception as e:
        pass

print(cleaned_dataset[1500])

{'title': 'Turkey Stuffing', 'ingredients': ['1 or 2 pkg. giblets (clean)', '1 lb. ground chuck', '1 lb. bulk pork sausage', '1 1/2 pkg. bread croutons or more', '4 hard rolls (about)', '2 cans mushrooms and juice', 'bunch of celery, diced', '2 to 3 onions, chopped', '1 to 2 cans chicken broth', '2 to 3 eggs'], 'directions': ['Simmer giblets and turkey gizzards and neck in one can chicken broth; add enough water to cover.', "Add Nature's Seasons, poultry seasoning, sage, chopped parsley, minced onion, salt and pepper to taste.", 'When fully cooked, drain off broth and reserve in refrigerator until ready to stuff turkey.', 'Dice meat finely or put through meat grinder.'], 'cleaned_ingredients': ['giblets', 'ground chuck', 'pork sausage', 'bread croutons', 'hard rolls', 'mushrooms', 'celery', 'onions', 'chicken broth', 'eggs']}


In [27]:
ELLIPSIS = Ellipsis  # the built-in ...
print("Before: ", len(cleaned_dataset))
def contains_ellipsis(obj) -> bool:
    # Recursively check if Ellipsis exists anywhere in the object
    if obj is ELLIPSIS:
        return True
    if isinstance(obj, dict):
        return any(contains_ellipsis(v) for v in obj.values())
    if isinstance(obj, (list, tuple, set)):
        return any(contains_ellipsis(v) for v in obj)
    return False

def drop_entries_with_ellipsis(data):
    # Assume data is a list of items (dicts or lists). Keep only those without Ellipsis.
    return [item for item in data if not contains_ellipsis(item)]

cleaned_dataset = drop_entries_with_ellipsis(cleaned_dataset)
print("After: ", len(cleaned_dataset))

Before:  51793
After:  51762


Save in case it gets lost

In [29]:
with open('/root/recipes_proc_1.json', 'w') as file:
    json.dump(cleaned_dataset, file, indent=2, ensure_ascii=False)

In [3]:
# load the dataset
import json
with open('/root/recipes_proc_1.json', 'r') as f:
    cleaned_dataset = json.load(f)

In [5]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B", trust_remote_code=True)
# Adjust tensor_parallel_size to number of GPUs
llm = LLM(model="Qwen/Qwen3-1.7B", tensor_parallel_size=1)  # set >1 for multi-GPU
params = SamplingParams(
    max_tokens=512,
    temperature=0.0,
    top_p=1.0,
)
SYSTEM_PROMPT = (
    "Summarize the recipe in 1-2 sentences. Focus on the main cooking method and key ingredients."
)

def build_messages(example):
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": (
                f"Title: {example['title']}\n"
                f"Ingredients: {example['ingredients']}\n"
                f"Directions: {example['directions']}"
            ),
        },
    ]

def build_prompt_with_template(example):
    messages = build_messages(example)
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,                 # return a string prompt
        add_generation_prompt=True      # append assistant prefix for generation
    )
    

prompts = [build_prompt_with_template(d) for d in cleaned_dataset]
outputs = llm.generate(prompts, params)
# Access text results
texts = [o.outputs[0].text for o in outputs]



INFO 10-26 20:02:24 [utils.py:233] non-default args: {'disable_log_stats': True, 'model': 'Qwen/Qwen3-1.7B'}


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

INFO 10-26 20:02:47 [model.py:547] Resolved architecture: Qwen3ForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 10-26 20:02:47 [model.py:1510] Using max model len 40960
INFO 10-26 20:02:48 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.


generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:02:50 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:02:50 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:03:21 [weight_utils.py:413] Time spent downloading weights for Qwen/Qwen3-1.7B: 26.757759 seconds


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:03:23 [default_loader.py:267] Loading weights took 1.23 seconds
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:03:24 [gpu_model_runner.py:2653] Model loading took 3.2152 GiB and 29.455371 seconds
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:03:46 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/a2cecaf7a9/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:03:46 [backends.py:559] Dynamo bytecode transform time: 21.24 s


[1;36m(EngineCore_DP0 pid=274)[0;0m [rank0]:W1026 20:03:59.548000 274 site-packages/torch/_inductor/utils.py:1436] [0/0] Not enough SMs to use max_autotune_gemm mode


[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:04:16 [backends.py:197] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:04:56 [backends.py:218] Compiling a graph for dynamic shape takes 58.41 s
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:35 [monitor.py:34] torch.compile takes 79.65 s in total
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:37 [gpu_worker.py:298] Available KV cache memory: 15.16 GiB
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:38 [kv_cache_utils.py:1087] GPU KV cache size: 141,920 tokens
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:38 [kv_cache_utils.py:1091] Maximum concurrency for 40,960 tokens per request: 3.46x


[1;36m(EngineCore_DP0 pid=274)[0;0m Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|                          | 0/67 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   1%|▎                 | 1/67 [00:00<00:07,  8.76it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   4%|▊                 | 3/67 [00:00<00:06, 10.47it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   7%|█▎                | 5/67 [00:00<00:05, 10.74it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  10%|█▉                | 7/67 [00:00<00:05, 10.83it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  13%|██▍               | 9/67 [00:00<00:05, 10.88it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|██▊              | 11/67 [00:01<00:04, 11.28it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  19%|███▎             | 13/67 [00:01<00:04, 11.40it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|███

[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:48 [gpu_model_runner.py:3480] Graph capturing finished in 10 secs, took 0.64 GiB
[1;36m(EngineCore_DP0 pid=274)[0;0m INFO 10-26 20:05:48 [core.py:210] init engine (profile, create kv cache, warmup model) took 144.14 seconds
INFO 10-26 20:05:51 [llm.py:306] Supported_tasks: ['generate']


Adding requests:   0%|          | 0/51762 [00:00<?, ?it/s]

Processed prompts:   0%|         | 0/51762 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s…

In [6]:
texts[:2]

['<think>\nOkay, let\'s see. The user wants a summary of the No-Bake Nut Cookies recipe in 1-2 sentences, focusing on the main cooking method and key ingredients.\n\nFirst, I need to identify the main cooking method. The directions mention mixing ingredients, boiling, and then shaping. But since it\'s a no-bake recipe, the actual baking is skipped. The key steps are mixing the ingredients, boiling them, then shaping and letting them set. So the main method here is mixing and boiling, then shaping and letting set.\n\nKey ingredients are brown sugar, evaporated milk, vanilla, nuts, butter, and rice biscuits. The main ingredients are the brown sugar, nuts, butter, and the rice biscuits. The other ingredients like evaporated milk and vanilla are secondary but important.\n\nI need to make sure the summary is concise. Maybe start with the main method: mixing and boiling. Then list the key ingredients. But the user wants 1-2 sentences. Let me check the example response. The example says "No-b

In [None]:
import ast
import re
dataset_copy = dataset
cleaned_dataset = []
for i, output in enumerate(outputs):
    text = output.outputs[0].text.strip()
    try:
        # Extract the first [...] block non-greedily
        match = re.search(r"\[[\s\S]*?\]", text)
        if match:
            bracket_content = match.group(0)  # includes [ ... ]
            # Convert the bracketed Python literal to a real list
            data = ast.literal_eval(bracket_content)
            # update data
            dataset_copy[i]["cleaned_ingredients"] = data
            cleaned_dataset.append(dataset_copy[i])
        else:
            # If no brackets found, append empty list or handle as needed
            cleaned_outputs.append([])
    except Exception as e:
        pass

print(cleaned_dataset[1500])

In [1]:
with open('/root/recipes_proc_2.json', 'w') as file:
    json.dump(cleaned_dataset, file, indent=2, ensure_ascii=False)

NameError: name 'texts' is not defined