# Deploy Codestral on Amazon SageMaker with vLLM

---

[Codestral](https://mistral.ai/news/codestral/) is an open-weight generative AI model explicitly designed for code generation tasks. It helps developers write and interact with code through a shared instruction and completion API endpoint. As it masters code and English, it can be used to design advanced AI applications for software developers. Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

SageMaker has rolled out [vLLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [vLLM](https://docs.vllm.ai/en/stable/) for distributed large language model inference. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.

In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.

This combination allows us to deploy the `Codestral 22B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and vLLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-mixtral-and-llama-2-models-with-new-amazon-sagemaker-containers/).

---

As a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding. With its larger context window of 32k (compared to 4k, 8k or 16k for competitors), Codestral outperforms all other models in RepoBench, a long-range eval for code generation.

![codestral](imgs/codestral.png)

<b><i>To deploy Codestral on to Sagemaker with TGI, please refer to the 'Deploy Codestral on TGI' notebook located in this folder.</b></i>

---

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Codestral is a 22B open-weight model licensed under the new Mistral AI Non-Production License, which means that you can use it for research and testing purposes. Codestral can be downloaded on HuggingFace.
If you want to use the model in the course of commercial activity, Commercial licenses are also available on demand by reaching out to the Mistral team.
</div>

##### Reach out to Mistral to explore Codestral for commercial use cases: [Contact the Mistral team](https://mistral.ai/contact/)

##### More on the Mistral AI Non-Production License: [Mistral AI Non-Production License](https://mistral.ai/news/mistral-ai-non-production-license-mnpl/)

---

## Requirements

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose `ml.t3.medium`.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [15]:
!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts
!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts

---

### Import libraries

In [1]:
import boto3
import json
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
print(sagemaker.__version__)

2.224.2


### Initialize parameters

In [4]:
# execution role for the endpoint
role = sagemaker.get_execution_role()

# sagemaker session for interacting with different AWS APIs
sess = sagemaker.session.Session()

# Region
region_name = sess._region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region_name}")

sagemaker session region: us-east-1


### Image URI of the DJL Container

LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. 

In [5]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-lmi",
    region=region_name,
    version="0.28.0"
)
print(f"DCL Image going to be used is ---- > {inference_image_uri}")

DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124


See more details about DLC images [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md).

### Available Environment Variable Configurations

Here is a list of settings that we use in this configuration file:

- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.
- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.
- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [vLLM](https://docs.vllm.ai/en/stable/) and hence set it as **Python**.
- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.
- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)
- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.
    - In the LMI Container:
        - to use vLLM, use `OPTION_ROLLING_BATCH=vllm`
        - to use lmi-dist, use `OPTION_ROLLING_BATCH=lmi-dist`
        - to use huggingface accelerate, use `OPTION_ROLLING_BATCH=auto` for text generation models, or option.rolling_batch=disable for non-text generation models.
- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
- `OPTION_DEVICE_MAP`: The HuggingFace accelerate device_map to use.
- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.

For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/starting-guide.html)

## Create SageMaker endpoint

In [6]:
# Hugging Face Model Id
model_id = "mistral-community/Codestral-22B-v0.1"

# SageMaker Instance Type
instance_type = "ml.g5.12xlarge"

# Endpoint name
endpoint_name_prefix = "codestral-22b-vllm"
endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)

print(f"instance_type: {instance_type}")
print(f"model_id: {model_id}")
print(f"endpoint_name: {endpoint_name}")

instance_type: ml.g5.12xlarge
model_id: mistral-community/Codestral-22B-v0.1
endpoint_name: codestral-22b-vllm-2024-06-28-12-16-20-778


In [7]:
# Deploy model to an endpoint
model = sagemaker.Model(
    image_uri=inference_image_uri,
    role=role,
    env={
        "HF_MODEL_ID": model_id,
        # "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
        "OPTION_ENGINE": "Python",
        "OPTION_DTYPE": "bf16",
        "OPTION_TASK": "text-generation",
        "OPTION_ROLLING_BATCH": "vllm",
        "TENSOR_PARALLEL_DEGREE": "max",
        "OPTION_DEVICE_MAP": "auto",
        # "OPTION_TRUST_REMOTE_CODE": "true"
    }
)

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
)

---------------!

## Run inference and chat with the model

### Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

---

### Sample code generation questions

1. "Create a Python class for a multi-threaded web scraper that can handle rate limiting, proxy rotation, and dynamic content loading. Include methods for parsing HTML with BeautifulSoup and storing results in a SQLite database."
2. "Implement a Red-Black Tree data structure in C++ with methods for insertion, deletion, and rebalancing. Include a visualization function that prints the tree structure to the console."
3. "Write a Rust function that implements the Aho-Corasick string matching algorithm for efficient multi-pattern searching. Optimize it for memory usage and include comprehensive error handling."
4. "Develop a JavaScript module for a real-time collaborative text editor using operational transformation. Implement functions for handling concurrent edits, conflict resolution, and syncing with a backend server."
5. "Create a Python script that uses asyncio to concurrently process large CSV files, perform complex data transformations, and upload the results to an S3 bucket. Include proper error handling and logging."
6. "Implement a microservices architecture in Go for a basic e-commerce platform. Include services for user authentication, product catalog, order processing, and inventory management. Use gRPC for inter-service communication and implement circuit breaking for resilience."
7. "Provide me with a python script to recompile huggingface models with optimum neuron for inferentia"

---

### Inference using SageMaker SDK

In [8]:
# Initialize sagemaker client with the endpoint created in the prior step
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [9]:
prompt = "Create a Python class for a multi-threaded web scraper that can handle rate limiting, proxy rotation, and dynamic content loading. Include methods for parsing HTML with BeautifulSoup and storing results in a SQLite database."

inputs = {
    "inputs": prompt,
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.95,
        "max_new_tokens": 4000,
        "do_sample": False
    }
}
response = predictor.predict(inputs)
print(response['generated_text'].strip())

This is a complex task that requires a good understanding of Python, web scraping, and multithreading. Here's a basic outline of how you might approach this:

1. Create a class `WebScraper` that initializes with a list of proxies and a rate limit.
2. In the class, create a method `rotate_proxy` that selects a proxy from the list and returns it.
3. Create a method `parse_html` that takes a URL as input, fetches the HTML content, and parses it using BeautifulSoup.
4. Create a method `store_results` that takes the parsed data and stores it in a SQLite database.
5. Create a method `scrape` that uses multithreading to scrape multiple URLs. This method should handle rate limiting and proxy rotation.

Here's a very basic example of how you might start this:

```python
import requests
from bs4 import BeautifulSoup
import sqlite3
import threading
import time
import random

class WebScraper:
    def __init__(self, proxies, rate_limit):
        self.proxies = proxies
        self.rate_limit = rat

### Inference using Boto3 SDK

In [10]:
# Initialize sagemaker client with boto3 using the endpoint created from prior step
smr_client = boto3.client("sagemaker-runtime")

In [11]:
prompt = "Implement a Red-Black Tree data structure in C++ with methods for insertion, deletion, and rebalancing. Include a visualization function that prints the tree structure to the console."

response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": {
                "temperature": 0.8,
                "top_p": 0.95,
                "max_new_tokens": 4000,
                "do_sample": False
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

print(json.loads(response)['generated_text'])



Here's a basic outline of the Red-Black Tree implementation:

1. Define the Node structure with color, key, left, right, and parent pointers.
2. Implement the Red-Black Tree class with a root pointer and methods for insertion, deletion, and rebalancing.
3. Implement the insertion method to insert a new node into the tree while maintaining the Red-Black Tree properties.
4. Implement the deletion method to delete a node from the tree while maintaining the Red-Black Tree properties.
5. Implement the rebalancing methods to fix any violations of the Red-Black Tree properties after insertion or deletion.
6. Implement the visualization function to print the tree structure to the console.

Here's an example implementation of the Red-Black Tree data structure in C++:

```cpp
#include <iostream>

enum Color { RED, BLACK };

struct Node {
    int key;
    Color color;
    Node* left;
    Node* right;
    Node* parent;

    Node(int key) : key(key), color(RED), left(nullptr), right(nullptr), par

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. 

## Clean Up

In [15]:
# Delete the endpoint
sess.delete_endpoint(endpoint_name)

In [16]:
# In case the end point failed we still want to delete the model
sess.delete_endpoint_config(endpoint_name)
model.delete_model()