# Serverless Inference

Serverless Inference Endpoints (**"SIE**) allows you to provision real-time inference endpoints without the need to provision and configure the underlying endpoint instances. SageMaker automatically provisions and scales the underlying available compute resources based on your inference traffic. Your SIE can scale them down to 0 in cases where there is no inference traffic.

Serverless Inference is functionally similar to SageMaker real-time inference. It supports many types of inference containers, including PyTorch and TensorFlow inference containers. 

In this example, we will deploy the Q&A NLP model from the HuggingFace Model Hub as SIE. Follow the steps below for this.

1. We start by making initial imports:

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

sagemaker_session = sagemaker.Session()
role = get_execution_role()

2. Next, we need to define runtime container for serverless endpoint. For this, we can use sagemaker utility `image_uris.retrieve()`. We must provide target versions of frameworks as well as serverless configuration to identify approrpiate image. Note that in serverless config `memory_size_in_mb` parameter defines the initial memory behind your endpoint and the max_concurrency parameter defines the maximum number of concurrent invocations your endpoint can handle before inference traffic gets throttled by SageMaker.

In [None]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
TRANSFORMER_VERSION = "4.17.0"

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096, max_concurrency=10,
)

In [None]:
image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface",
    base_framework_version=f"pytorch{PYTORCH_VERSION}",
    region=sagemaker.Session().boto_region_name,
    version=TRANSFORMER_VERSION,
    py_version=PYTHON_VERSION,
    serverless_inference_config=serverless_config,
    image_scope="inference",
)

print(f"Container to be used: {image_uri}")

3. Then, we will use the HuggingFaceModel instance to configure the model architecture and target NLP task:

In [None]:

hub = {
    'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
    'HF_TASK':'question-answering'
}

huggingface_model = HuggingFaceModel(
   env=hub,
   role= role,
   transformers_version=TRANSFORMER_VERSION,
   pytorch_version=PYTORCH_VERSION,
   py_version=PYTHON_VERSION,
   image_uri=image_uri,
)

4. Finally, we deploy our model to serverless endpoint:

In [None]:
predictor = huggingface_model.deploy(
    serverless_inference_config=serverless_config
)

5. To test our serverless endpoint, run the cell below.

In [None]:
context = r"""
The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.
"""

question="What kind of forest is Amazon?"
data = {"context":context, "question":question}

In [None]:
res = predictor.predict(data=data)

print(res)

# Clean up resources

Run following cell to delete cloud resources:

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
huggingface_model.delete_model()