# Sentence Embeddings with Hugging Face Transformers, Sentence Transformers and Amazon SageMaker - Custom Inference for creating document embeddings with Hugging Face's Transformers


Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker Python SDK to create a [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) running a Sentence Transformers for document embeddings. Currently, the [SageMaker Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from Transformers for zero-code deployment. This means you can run compatible Hugging Face Transformer models without providing pre- & post-processing code. Therefore we only need to provide an environment variable `HF_TASK` and `HF_MODEL_ID` when creating our endpoint and the Inference Toolkit will take care of it. This is a great feature if you are working with existing [pipelines](https://huggingface.co/transformers/main_classes/pipelines.html).

If you want to run other tasks, such as creating document embeddings, you can the pre- and post-processing code yourself, via an `inference.py` script. The Hugging Face Inference Toolkit allows the user to override the default methods of the `HuggingFaceHandlerService`.

The custom module can override the following methods:

- `model_fn(model_dir)` overrides the default method for loading a model. The return value `model` will be used in the`predict_fn` for predictions.
  -  `model_dir` is the the path to your unzipped `model.tar.gz`.
- `input_fn(input_data, content_type)` overrides the default method for pre-processing. The return value `data` will be used in `predict_fn` for predictions. The inputs are:
    - `input_data` is the raw body of your request.
    - `content_type` is the content type from the request header.
- `predict_fn(processed_data, model)` overrides the default method for predictions. The return value `predictions` will be used in `output_fn`.
  - `model` returned value from `model_fn` methond
  - `processed_data` returned value from `input_fn` method
- `output_fn(prediction, accept)` overrides the default method for post-processing. The return value `result` will be the response to your request (e.g.`JSON`). The inputs are:
    - `predictions` is the result from `predict_fn`.
    - `accept` is the return accept type from the HTTP Request, e.g. `application/json`.

In this example are we going to use Sentence Transformers to create sentence embeddings using a mean pooling layer on the raw representation.

*NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances*

## Development Environment and Permissions

### Installation 


In [2]:
%pip install sagemaker --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Install `git` and `git-lfs`

In [4]:
# For notebook instances (Amazon Linux)
# !sudo yum update -y 
# !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
# !yum install git-lfs git -y
# For other environments (Ubuntu)
!apt-get update -y 
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt-get install git-lfs git -y

Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Fetched 114 kB in 1s (149 kB/s)    
Reading package lists... Done
curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
Detected operating system as Ubuntu/focal.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Detected apt version as 2.0.9
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
done.
Importing packagecloud gpg key... curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
Packagecloud gpg key imported to /etc/apt/keyrings/github_git-lfs-archive-keyring.gpg
done.
Running 

### Permissions

_If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [5]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::263439870041:role/service-role/AmazonSageMaker-ExecutionRole-20230428T140908
sagemaker bucket: sagemaker-us-east-2-263439870041
sagemaker session region: us-east-2


## Create custom an `inference.py` script

To use the custom inference script, you need to create an `inference.py` script. In our example, we are going to overwrite the `model_fn` to load our sentence transformer correctly and the `predict_fn` to apply mean pooling.

We are going to use the [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [6]:
!mkdir code

In [7]:
%%writefile code/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Helper: Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir):
  # Load model from HuggingFace Hub
  tokenizer = AutoTokenizer.from_pretrained(model_dir)
  model = AutoModel.from_pretrained(model_dir)
  return model, tokenizer

def predict_fn(data, model_and_tokenizer):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer
    
    # Tokenize sentences
    sentences = data.pop("inputs", data)
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings[0].tolist()}

Writing code/inference.py


## Create `model.tar.gz` with inference script and model 

To use our `inference.py` we need to bundle it into a `model.tar.gz` archive with all our model-artifcats, e.g. `pytorch_model.bin`. The `inference.py` script will be placed into a `code/` folder. We will use `git` and `git-lfs` to easily download our model from hf.co/models and upload it to Amazon S3 so we can use it when creating our SageMaker endpoint.

In [8]:
repository = "sentence-transformers/all-MiniLM-L6-v2"
model_id=repository.split("/")[-1]
s3_location=f"s3://{sess.default_bucket()}/custom_inference/{model_id}/model.tar.gz"

1. Download the model from hf.co/models with `git clone`.

In [9]:
!git-lfs install
!git clone https://huggingface.co/$repository

Error: Failed to call git rev-parse --git-dir: exit status 128 
Git LFS initialized.
Cloning into 'all-MiniLM-L6-v2'...
remote: Enumerating objects: 46, done.[K
remote: Total 46 (delta 0), reused 0 (delta 0), pack-reused 46[K
Unpacking objects: 100% (46/46), 311.33 KiB | 270.00 KiB/s, done.
Filtering content: 100% (3/3), 260.15 MiB | 38.70 MiB/s, done.


2. copy `inference.py`  into the `code/` directory of the model directory.

In [10]:
!cp -r code/ $model_id/code/

In [11]:
!echo $model_id

all-MiniLM-L6-v2


3. Create a `model.tar.gz` archive with all the model artifacts and the `inference.py` script.


In [12]:
%cd $model_id
!tar zcvf model.tar.gz *

/root/spectrum-labs/all-MiniLM-L6-v2
1_Pooling/
1_Pooling/config.json
README.md
code/
code/inference.py
config.json
config_sentence_transformers.json
data_config.json
modules.json
pytorch_model.bin
rust_model.ot
sentence_bert_config.json
special_tokens_map.json
tf_model.h5
tokenizer.json
tokenizer_config.json
train_script.py
vocab.txt


4. Upload the `model.tar.gz` to Amazon S3:


In [13]:
!aws s3 cp model.tar.gz $s3_location

upload: ./model.tar.gz to s3://sagemaker-us-east-2-263439870041/custom_inference/all-MiniLM-L6-v2/model.tar.gz


## Create custom `HuggingfaceModel` 

After we have created and uploaded our `model.tar.gz` archive to Amazon S3. Can we create a custom `HuggingfaceModel` class. This class will be used to create and deploy our SageMaker endpoint.

In [14]:
from sagemaker.huggingface.model import HuggingFaceModel


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_location,       # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.26",  # transformers version used
   pytorch_version="1.13",        # pytorch version used
   py_version='py39',            # python version used
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge"
    )

----------!

## Request Inference Endpoint using the `HuggingfacePredictor`

The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference.

In [15]:
data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
print(res)


{'vectors': [0.005078199319541454, -0.003659447655081749, 0.016988737508654594, -0.0015786292497068644, 0.0302036814391613, 0.09331895411014557, -0.023515766486525536, 0.011795221827924252, 0.03421776741743088, -0.027907807379961014, -0.03260171040892601, 0.067980095744133, 0.01522375363856554, 0.025948477908968925, -0.07854386419057846, -0.002391602611169219, 0.10089628398418427, 0.001498094410635531, -0.01777803897857666, 0.005812615621834993, 0.02445337176322937, -0.07103715091943741, 0.04755861684679985, 0.02636093646287918, -0.057162512093782425, -0.09400142729282379, 0.047948963940143585, 0.008600173518061638, 0.032970305532217026, -0.06984363496303558, -0.0552142858505249, -0.03234357014298439, -0.0003443291934672743, 0.012479405850172043, -0.07419361919164658, 0.0854540765285492, 0.019597090780735016, 0.005851503927260637, -0.08256842941045761, 0.010150201618671417, 0.028275245800614357, -0.001612170715816319, 0.04174524545669556, -0.009756682440638542, 0.03546827659010887, -0.

### Delete model and endpoint

To clean up, we can delete the model and endpoint.

In [15]:
predictor.delete_model()
predictor.delete_endpoint()