# Intro
This notebook lets you run the training and inference on the local EC2 instance instead of the other notebooks that use SageMaker.

# Install requirements

We'll use the same files as the SageMaker training, so we'll first move to the assets directory and run our scripts from there.

In [1]:
%cd /home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets

/home/ubuntu/environment/FineTuning/HuggingFaceExample/01_finetuning/assets


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
%pip install -r requirements.txt


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting optimum-neuron==0.3.0 (from -r requirements.txt (line 1))
  Downloading optimum_neuron-0.3.0-py3-none-any.whl.metadata (16 kB)
Collecting peft==0.16.0 (from -r requirements.txt (line 2))
  Downloading peft-0.16.0-py3-none-any.whl.metadata (14 kB)
Collecting trl==0.11.4 (from -r requirements.txt (line 3))
  Downloading trl-0.11.4-py3-none-any.whl.metadata (12 kB)
Collecting huggingface_hub==0.33.4 (from -r requirements.txt (line 4))
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting datasets==3.6.0 (from -r requirements.txt (line 5))
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting accelerate==1.8.1 (from optimum-neuron==0.3.0->-r requirements.txt (line 1))
  Downloading accelerate-1.8.1-py3-none-any.whl.metadata (19 kB)
Collecting optimum~=1.24.0 (from optimum-neuron==0.3.0->-r requirements.txt (line 1))
  Downloading optimum-1.24.0-py3-

# Training

We will use the same training scripts as we do in the SageMaker examples, we just need to launch them with the torchrun process and the same parameters that we would have passed in.  See the Finetune-TinyLlama-1.1B notebook for more information on the parameters.

Additionally, this example uses a Qwen model.



In [3]:
!torchrun --nnodes 1 --nproc_per_node 2 \
finetune_llama.py \
--bf16 True --dataloader_drop_last True --disable_tqdm True --gradient_accumulation_steps 1 \
--gradient_checkpointing True --learning_rate 5e-05 --logging_steps 10 --lora_alpha 32 \
--lora_dropout 0.05 --lora_r 16 --max_steps 1000 \
--model_id Qwen/Qwen3-1.7B --output_dir ~/environment/ml/qwen \
--per_device_train_batch_size 2 --tensor_parallel_size 2 \
--tokenizer_id Qwen/Qwen3-1.7B

W1004 21:09:31.015000 30300 torch/distributed/run.py:766] 
W1004 21:09:31.015000 30300 torch/distributed/run.py:766] *****************************************
W1004 21:09:31.015000 30300 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1004 21:09:31.015000 30300 torch/distributed/run.py:766] *****************************************
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(confi

# Compilation

Since we have everything installed locally, we don't need to use a training job like on SageMaker.  We can just call the optimum-cli command directly.

The training process runs a merge script at the end, so we are using the output_dir and adding a merged_model path and then saving our compiled model into the compiled_model path.

In [6]:
!optimum-cli export neuron --model /home/ubuntu/environment/ml/qwen/merged_model --task text-generation --sequence_length 512 --batch_size 1 --num_cores 2 /home/ubuntu/environment/ml/qwen/compiled_model


  from pkg_resources import get_distribution
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from .mappings import (
  from .mappings import (
  from .mappings import (
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  component, error = import_nki(config)
  from pkg_resources import get_distribution
  from ..attention.gqa import (
  from ..backend.modules.attention.attention_base import NeuronAttentionBase
INFO:Neuron:Generating HLOs for the following models: ['context_encoding_model', 'token_generation_model']
[2025-09-04 13:24:24.405: I neuronx_distributed/parallel_l

# Inference

We will install the Optimum Neuron vllm option.  Then, run inference using the compiled model!

In [None]:
%pip install optimum-neuron[vllm]


In [None]:
import os
from vllm import LLM, SamplingParams
llm = LLM(
    model="/home/ubuntu/environment/ml/qwen/compiled_model", #local compiled model
    max_num_seqs=1,
    max_model_len=2048,
    device="neuron",
    tensor_parallel_size=2,
    override_neuron_config={})
example1="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)<|im_end|>
<|im_start|>user
How many departments are led by heads who are not mentioned?<|im_end|>
<|im_start|>assistant
"""
example2="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE student_course_registrations (student_id VARCHAR, course_id VARCHAR)<|im_end|>
<|im_start|>user
What are the ids of all students for courses and what are the names of those courses?<|im_end|>
<|im_start|>assistant
"""
example3="""
<|im_start|>system
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE table_name_9 (wins INTEGER, year VARCHAR, team VARCHAR, points VARCHAR)<|im_end|>
<|im_start|>user
Which highest wins number had Kawasaki as a team, 95 points, and a year prior to 1981?<|im_end|>
<|im_start|>assistant
"""

prompts = [
    example1,
    example2,
    example3
]

sampling_params = SamplingParams(max_tokens=2048, temperature=0.8)
outputs = llm.generate(prompts, sampling_params)

print("#########################################################")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n\n Generated text: {generated_text!r} \n")

INFO 09-04 13:30:37 [config.py:841] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 09-04 13:30:37 [config.py:1472] Using max model len 2048
INFO 09-04 13:30:37 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/home/ubuntu/environment/ml/qwen/compiled_model', speculative_config=None, tokenizer='/home/ubuntu/environment/ml/qwen/compiled_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config

INFO:Neuron:Loading sharded checkpoint from /home/ubuntu/environment/ml/qwen/compiled_model/checkpoint/weights


INFO 09-04 13:30:39 [executor_base.py:113] # neuron blocks: 2, # CPU blocks: 0
INFO 09-04 13:30:39 [executor_base.py:118] Maximum concurrency for 2048 tokens per request: 2.00x
INFO 09-04 13:30:39 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 0.00 seconds


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

#########################################################
Prompt: '\n<|system|>\nYou are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)</s>\n<|user|>\nHow many departments are led by heads who are not mentioned?</s>\n<|assistant|>\n', 

 Generated text: 'SELECT COUNT(*) FROM management WHERE department_id NOT IN (SELECT department_id FROM department);' 

Prompt: '\n<|system|>\nYou are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE courses (course_name VARCHAR, course_id VARCHAR); CREATE TABLE \nstudent_course_registrations (student_id VARCHAR, course_id VARCHAR)</s>\n<|user|>\nWhat are the ids of all students for courses and what are the names of those courses?</s>\n<|assistant|>\n',