# Deploy a fine-tuned TinyLlama-1.1B model for text-to-SQL inference

## Introduction

In this workshop module, you will learn how to deploy a Large Language Model (LLM) to [Amazon EC2 inf2 instance](https://aws.amazon.com/ec2/instance-types/inf2/) for generative AI inference.
You will use Amazon SageMaker with [Hugging Face TGI images specific for Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/locate-neuron-dlc-image.html) to deploy the model fine-tuned in the previous workshop module. Amazon SageMaker Hosting provides fully managed options for deploying our models for Real-Time or Batch inference modes. AWS Inferentia provides the best cost per inference.

This workbook assumes that you have previously run the Finetune-TinyLlama-1.1B module and you have copied the s3 path for the finetuned model.  If you didn't complete that for some reason (and we recommend you do), you can still deploy a copy of the same finetuned model that we posted on Hugging Face at aws-neuron/NeuronWorkshop2025 .  You'll still need to run the Prerequisites section, skip the Compilation, and change the HF_MODEL_ID and comment out the S3 path in the Create SageMaker Endpoint section. (There are comments to show you what to change)

## Prerequisites

This notebook uses the SageMaker Python SDK to deploy a fine-tuned model using SageMaker hosting service. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session.

In [24]:
# Upgrade SageMaker SDK to the latest version
%pip install -U sagemaker awscli -q 2>&1 | grep -v "warnings/venv"

Note: you may need to restart the kernel to use updated packages.


This next command just configures the EC2 instance (in us-west-2) to have a default region of us-east-2.  This is specific to the environment in AWS Workshop Studio.

In [1]:
#Just in case you didn't run it in the fine-tune notebook
!aws configure set region us-east-2

In [2]:
import logging 
sagemaker_config_logger = logging.getLogger("sagemaker.config") 
sagemaker_config_logger.setLevel(logging.WARNING)

# Import SageMaker SDK, setup our session
import sagemaker
from sagemaker import Model, image_uris, serializers, utils
import boto3

# NOTE: We currently need to use us-east-2 for model deployment when running this notebook in an AWS Workshop Studio event.
boto3_sess = boto3.Session(region_name="us-east-2")

sess = sagemaker.session.Session(boto_session = boto3_sess)  # sagemaker session for interacting with different AWS APIs
role = sagemaker.get_execution_role()  # execution role for the endpoint

## Specify the Hugging Face container image

[SageMaker hosting containers for Inferentia and Trainium](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/locate-neuron-dlc-image.html) use the [Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) to support the NeuronCores on Inferentia and Trainium devices. The [Hugging Face TGI server project](https://github.com/huggingface/text-generation-inference) supports both GPUs and Neuron devices. A version of that server that supports SageMaker and Neuron can be found with the get_huggingface_llm_image_uri command in the SageMaker SDK.  In this case, we supply the server type (huggingface-neuronx) along with the region and Optimum Neuron version number.


This image facilitates the loading of models onto [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) accelerators, parallelizes the model across multiple [NeuronCores](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch), and enables serving via HTTP endpoints.

In [6]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

image_uri = get_huggingface_llm_image_uri(
    "huggingface-neuronx",
    region=sess.boto_session.region_name,
    version="0.0.28"
    )
image_uri

'763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.28-neuronx-py310-ubuntu22.04'

## Compiling the model for Neuron

The TGI container expects either a model that has been compiled for Neuron or a reference to a model architecture that is stored in the [Optimum Neuron model cache](https://github.com/huggingface/optimum-neuron/blob/main/docs/source/guides/cache_system.mdx)

We will do that with a second training job in SageMaker.  It is important that the image_uri you use for compilation is the same as what you will use for hosting.  (this may be a different URI than you used for training)

What the training job does is call the optimum-cli command in the image with the model path as well as the parameters for compilation.  See the [Optimum-Neuron documentation](https://github.com/huggingface/optimum-neuron/blob/main/docs/source/guides/export_model.mdx#exporting-a-model-to-neuron-using-the-cli) for more details.

In the following cell, you will need to update *`s3_orig_model_path`* with the S3 path you copied from the previous workshop module where fine-tuned model artifact is available. It should be something like
```
s3_orig_model_path="s3://sagemaker-us-east-2-xxxxxxxxxxxx/neuron_events2025/trn1-tinyllama-2024-12-xx-xx-xx-xx-xxx/output/model/"
```


In [7]:
s3_orig_model_path="s3://sagemaker-us-east-2-293736553224/neuron_events2025/trn1-tinyllama-2025-10-04-23-08-18-354/output/model/"  # <- change this path to your S3 model path from the Finetune notebook
# s3_orig_model_path="s3://sagemaker-us-east-2-293736553224/neuron_events2025/trn1-tinyllama-2025-10-04-23-08-18-354/output/model/"

The settings in the container_arguments below for sequence length, batch size, and number of cores must match the settings you will use in your hub environment variables later.  The version of the SDK and Optimum Neuron must match as well, but we ensure that by using the same container for both compilation as well as hosting.

In [8]:

# Define the parameters
s3_output_path=f"{s3_orig_model_path}compiled_model/"
print("s3_output_path",s3_output_path)
training_job_name = utils.name_from_base("TGICompilation")
print("training_job_name")
s3_model_path = f"{s3_orig_model_path}merged_model/"
print("s3_model_path",s3_model_path)

container_entrypoint = ["optimum-cli"]
container_arguments = ["export", "neuron", "--model", "/opt/ml/input/data/modeldir/", "--task", "text-generation", "--sequence_length", "512", "--batch_size", "1", "--num_cores", "2", "/opt/ml/output/data/"]

input_data_config = {
    "ChannelName": "modeldir",
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "S3Prefix",
            "S3Uri": s3_model_path
        }
    }
}
output_data_config = {
    "S3OutputPath": s3_output_path
}
resource_config = {
    "VolumeSizeInGB": 20,
    "InstanceCount": 1,
    "InstanceType": "ml.trn1.2xlarge"
}
stopping_condition = {
    "MaxRuntimeInSeconds": 1800
}

# Create the SageMaker client
sagemaker = boto3.client('sagemaker', region_name=sess.boto_session.region_name)

# Create the training job
response = sagemaker.create_training_job(
    TrainingJobName=training_job_name,
    RoleArn=role,
    AlgorithmSpecification={
        'TrainingInputMode': 'File',
        'TrainingImage': image_uri,
        'ContainerEntrypoint': container_entrypoint,
        'ContainerArguments': container_arguments
    },
    InputDataConfig=[input_data_config],
    OutputDataConfig=output_data_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition
)

s3_output_path s3://sagemaker-us-east-2-293736553224/neuron_events2025/trn1-tinyllama-2025-10-04-23-08-18-354/output/model/compiled_model/
training_job_name
s3_model_path s3://sagemaker-us-east-2-293736553224/neuron_events2025/trn1-tinyllama-2025-10-04-23-08-18-354/output/model/merged_model/


Just like before, this code will check on the status of the training job used for compilation every 30 seconds.  It should take 5-6 minutes for the compilation job to finish.  (This is pulling from the [Neuron Model cache](https://huggingface.co/docs/optimum-neuron/en/guides/cache_system).  If you are using this code to compile a different model, it could take longer to run.)

In [9]:
# Periodically check job status until it shows 'Completed' (ETA ~6 minutes)
#  You can also monitor job status in the SageMaker console, and view the
#  SageMaker Training job logs in the CloudWatch console
from time import sleep
from datetime import datetime

while (job_status := sess.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)['TrainingJobStatus']) not in ['Completed', 'Error', 'Failed']:
    print(f"{datetime.now().isoformat()} Training job {training_job_name} status: {job_status}!")
    sleep(30)

print(f"\n{datetime.now().isoformat()} Training job status: {job_status}!")

2025-10-05T00:41:47.289320 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:42:17.403059 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:42:47.503846 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:43:17.618991 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:43:47.732543 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:44:17.840792 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:44:47.953806 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:45:18.065107 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:45:48.174923 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!
2025-10-05T00:46:18.473827 Training job TGICompilation-2025-10-05-00-41-44-138 status: InProgress!

2025-10-0

## Create SageMaker Endpoint
Next, we create the SageMaker endpoint with the model configuration defined earlier. We use the `ml.inf2.xlarge` instance containing a single Inferentia2 accelerator with 2 NeuronCores. Model deployment will usually take 4-5 minutes.

In [10]:
hub = {
    "HF_MODEL_ID": "/opt/ml/model/",
    #"HF_MODEL_ID": "aws-neuron/NeuronWorkshop2025", #You only need to use this if you didn't successfully train and compile the model
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "1",
    "MAX_INPUT_LENGTH": "500",
    "MAX_TOTAL_TOKENS": "512",
}

In [11]:
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
s3_new_model_path = f"{s3_output_path}{training_job_name}/output/output.tar.gz"


huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
    sagemaker_session = sess,
    model_data=s3_new_model_path #comment out this line if you are using the aws-neuron/NeuronWorkshop2025 model directly from Hugging Face

)


In [12]:
instance_type = "ml.inf2.xlarge"
endpoint_name = utils.name_from_base("tinyllama-finetuned-model")
print("endpoint_name", endpoint_name)

endpoint_name tinyllama-finetuned-model-2025-10-05-00-47-53-191


*`You can ignore the message that says "Your model is not compiled. Please compile your model before using Inferentia."`*

It should take 6-7 minutes to deploy the endpoint.  You will know it is done when you see an exclamation point at the end of the dashes.

In [15]:
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=500,
    endpoint_name=endpoint_name,
    volume_size=512,
)

Your model is not compiled. Please compile your model before using Inferentia.


## Inference tests
After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:
- Create a predictor for submit inference requests and receive responses
- Responses include the initial request

Keep in mind that this is a small model that has only been trained for 1000 steps, so while the responses should be formatted as SQL, they might not quite be what is expected.  See the optional section below that includes output from the original (not fine tuned) model and you can see it is more conversational.

Let's submit a few inference requests to the model server and display the inference results:

In [4]:
example="""
<|system|>
You are a knowledgeable and friendly expert on Minecraft.
Your task is to answer questions in clear, factual English based on your understanding of Minecraft.
Do not generate unrelated or fictional content outside Minecraft.</s>
<|user|>
What is the purpose of the nourishment table in Minecraft when it comes to categorizing foods?</s>
<|assistant|>
"""

In [5]:
data = {
   "inputs": example
}

result = predictor.predict(data)

print(result[0]['generated_text'])


In [18]:
example="""
<|system|>
You are a knowledgeable and friendly expert on Minecraft.
Your task is to answer questions in clear, factual English based on your understanding of Minecraft.
Do not generate unrelated or fictional content outside Minecraft.</s>
<|user|>
What is the purpose of the /function command in the 1.12-pre1 pre-release of Java Edition?</s>
<|assistant|>
"""

In [19]:
import json
result = predictor.predict(
    {"inputs": example, "parameters": {"do_sample": True,"max_new_tokens": 100,"temperature": 0.7,"watermark": True}}
)

print(result[0]['generated_text'])




<|system|>
You are a knowledgeable and friendly expert on Minecraft.
Your task is to answer questions in clear, factual English based on your understanding of Minecraft.
Do not generate unrelated or fictional content outside Minecraft.</s>
<|user|>
What is the purpose of the /function command in the 1.12-pre1 pre-release of Java Edition?</s>
<|assistant|>
In the 1.12-pre1 pre-release of Java Edition, the /function command is used to execute pre-created commands, allowing for more flexibility and control over the game's behavior.;</s>


In [22]:
example="""
<|system|>
You are a knowledgeable and friendly expert on Minecraft.
Your task is to answer questions in clear, factual English based on your understanding of Minecraft.
Do not generate unrelated or fictional content outside Minecraft.</s>
<|user|>
What was the 41st snapshot released for Minecraft's Java Edition 1.8?</s>
<|assistant|>
"""

In [23]:
result = predictor.predict(
    {"inputs": example, "parameters": {"do_sample": True,"max_new_tokens": 100,"temperature": 0.7,"watermark": True}}
)

print(result[0]['generated_text'])


<|system|>
You are a knowledgeable and friendly expert on Minecraft.
Your task is to answer questions in clear, factual English based on your understanding of Minecraft.
Do not generate unrelated or fictional content outside Minecraft.</s>
<|user|>
What was the 41st snapshot released for Minecraft's Java Edition 1.8?</s>
<|assistant|>
Minecraft's 41st snapshot was released for Java Edition 1.8 on April 1st, 2016.;</s>


## Cleanup the environment

In [21]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
#model.delete_model()

Congratulations on completing the LLM deployment for the inference module!

## (Optional) Deploy original TinyLlama model from Hugging Face hub

If you have spare time, you can also consider deploying the original TinyLlama model from [Hugging Face hub](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) for even more fun !

In this scenario, you can specify the name of the Hugging Face model using the *`model_id`* parameter to download the model directly from the Hugging Face repo. The remaining steps of the process remain the same as before.

In [22]:
import logging 
sagemaker_config_logger = logging.getLogger("sagemaker.config") 
sagemaker_config_logger.setLevel(logging.WARNING)

# Import SageMaker SDK, setup our session
import sagemaker
from sagemaker import Model, image_uris, serializers, utils
import boto3

# NOTE: We currently need to use us-east-2 for model deployment when running this notebook in an AWS Workshop Studio event.
boto3_sess = boto3.Session(region_name="us-east-2")

sess = sagemaker.session.Session(boto_session = boto3_sess)  # sagemaker session for interacting with different AWS APIs
role = sagemaker.get_execution_role()  # execution role for the endpoint

In [23]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

image_uri = get_huggingface_llm_image_uri(
    "huggingface-neuronx",
    region=sess.boto_session.region_name,
    version="0.0.28"
    )
image_uri

'763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.28-neuronx-py310-ubuntu22.04'

In [None]:
hub = {
    "HF_MODEL_ID": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    #"HF_MODEL_ID": "aws-neuron/NeuronWorkshop2025", # this is the fine tuned model you would get if you ran the Finetune-TinyLlama-1.1B notebook
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "1",
    "MAX_INPUT_LENGTH": "500",
    "MAX_TOTAL_TOKENS": "512",
}

In [None]:
from sagemaker.huggingface import HuggingFaceModel

huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
    sagemaker_session = sess,
)

In [26]:
instance_type = "ml.inf2.xlarge"
endpoint_name = utils.name_from_base("tinyllama-finetuned-model")
print("endpoint_name", endpoint_name)

endpoint_name tinyllama-finetuned-model-2025-05-13-01-35-06-102


This next cell may take 5-6 minutes to run while the endpoint is deploying.  

*`You can ignore the message that says "Your model is not compiled. Please compile your model before using Inferentia."`*

In [27]:
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=500,
    endpoint_name=endpoint_name,
    volume_size=512,
)

Your model is not compiled. Please compile your model before using Inferentia.


---------------!

In [28]:
example="""
<|system|>
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)</s>
<|user|>
How many departments are led by heads who are not mentioned?</s>
<|assistant|>
"""

In [29]:
import json
result = predictor.predict(
    {"inputs": example, "parameters": {"do_sample": True,"max_new_tokens": 100,"temperature": 0.7,"watermark": True}}
)

print(result[0]['generated_text'])



<|system|>
You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE management (department_id VARCHAR); CREATE TABLE department (department_id VARCHAR)</s>
<|user|>
How many departments are led by heads who are not mentioned?</s>
<|assistant|>
There is no information provided in the given text that suggests the number of departments led by heads who are not mentioned.</s>


In [16]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
#model.delete_model()