# Maximizing Online Inference: Exploring Ray Serve and Diverse Methodologies

Ray Serve is an adaptable model deployment framework designed for constructing real-time inference APIs. It's framework-agnostic, allowing you to use a single toolkit to serve a wide range of tools,models and services.

![Serve Positioning](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/serve_positioning.png)

![Serve Architecture](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/serve_arch.png)

# A Deep Dive into Ray Serve with the Agile FastAPI Web Framework Approach

In [1]:
# Run this in a separate process to avoid any blocking:
! serve run --non-blocking check:app --working-dir="./" --host "0.0.0.0" --port 8000

2023-10-29 14:56:41,666	INFO scripts.py:407 -- Running import path: 'check:app'.
2023-10-29 14:56:42,469	INFO client_builder.py:237 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
The new client HTTP config differs from the existing one in the following fields: ['location']. The new HTTP config is ignored.
2023-10-29 14:56:55,097	INFO router.py:853 -- Using PowerOfTwoChoicesReplicaScheduler.
2023-10-29 14:56:55,106	INFO router.py:329 -- Got updated replicas for deployment default_RAYFastAPIDeployment: {'default_RAYFastAPIDeployment#WOWWhw', 'default_RAYFastAPIDeployment#PstZJC'}.
2023-10-29 14:56:55,110	SUCC scripts.py:448 -- [32mDeployed Serve app successfully.[39m
[0m

![NLP API Architecture](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/nlp_api_arch.png)

In [2]:
input_text = '''
The Blob has long captivated my imagination, emerging as the quintessential cinematic nightmare:
an insatiable, amoebic entity with the eerie ability to breach virtually any defense, 
ominously described by a fated scientist as  assimilating flesh on contact. 
Mocking parallels to gelatin are futile for this concept embodies the gravest of implications, akin to the cataclysmic 
gray goo scenario envisioned by technophiles haunted by the specter of runaway artificial intelligence '''

In [3]:
import requests
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__)
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('status_code'))
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('_content').decode('utf8'))

{'_content': b'[-0.18333333333333335, -0.6]', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'date': 'Sun, 29 Oct 2023 14:56:48 GMT', 'server': 'uvicorn', 'content-type': 'application/json', 'ray_serve_request_id': 'bhhwtNlXRi', 'Transfer-Encoding': 'chunked'}, 'raw': <urllib3.response.HTTPResponse object at 0x7fbc648647c0>, 'url': 'http://10.224.1.8:8000/check?input_text=%0AThe+Blob+has+long+captivated+my+imagination%2C+emerging+as+the+quintessential+cinematic+nightmare%3A%0Aan+insatiable%2C+amoebic+entity+with+the+eerie+ability+to+breach+virtually+any+defense%2C+%0Aominously+described+by+a+fated+scientist+as++assimilating+flesh+on+contact.+%0AMocking+parallels+to+gelatin+are+futile+for+this+concept+embodies+the+gravest+of+implications%2C+akin+to+the+cataclysmic+%0Agray+goo+scenario+envisioned+by+technophiles+haunted+by+the+specter+of+runaway+artificial+intelligence+', 'encoding': 'utf-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'ela

In [7]:
input_text = '''Greetings, World! The Desired Request Handling: Executed as Anticipated!'''

In [8]:
import requests
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__)
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('status_code'))
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('_content').decode('utf8'))

{'_content': b'', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'date': 'Sun, 29 Oct 2023 14:57:25 GMT', 'server': 'uvicorn', 'content-type': 'text/plain', 'ray_serve_request_id': 'GoVcnGRvZG', 'Transfer-Encoding': 'chunked'}, 'raw': <urllib3.response.HTTPResponse object at 0x7fbc6486eaf0>, 'url': 'http://10.224.1.8:8000/check?input_text=Greetings%2C+World%21+The+Desired+Request+Handling%3A+Executed+as+Anticipated%21', 'encoding': 'ISO-8859-1', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=19417), 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7fbc64028c70>}
200
[0.0, 0.0]


In [9]:
import requests
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__)
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('status_code'))
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('_content').decode('utf8'))

{'_content': b'[0.0, 0.0]', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'date': 'Sun, 29 Oct 2023 14:57:27 GMT', 'server': 'uvicorn', 'content-type': 'application/json', 'ray_serve_request_id': 'JJHsrKwHaQ', 'Transfer-Encoding': 'chunked'}, 'raw': <urllib3.response.HTTPResponse object at 0x7fbc6486e130>, 'url': 'http://10.224.1.8:8000/check?input_text=Greetings%2C+World%21+The+Desired+Request+Handling%3A+Executed+as+Anticipated%21', 'encoding': 'utf-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=18812), 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7fbc6486e8b0>}
200
[0.0, 0.0]


# Navigating Ray Serve through RayServeSyncHandle's Methodological Approach

In [8]:
# Run this in a separate process to avoid any blocking:
! serve run --non-blocking snycheck:app --working-dir="./"

2023-10-29 15:13:12,948	INFO scripts.py:407 -- Running import path: 'snycheck:app'.
2023-10-29 15:13:13,673	INFO client_builder.py:237 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
The new client HTTP config differs from the existing one in the following fields: ['location']. The new HTTP config is ignored.
2023-10-29 15:13:26,256	INFO router.py:853 -- Using PowerOfTwoChoicesReplicaScheduler.
2023-10-29 15:13:26,266	INFO router.py:329 -- Got updated replicas for deployment default_RAYFastAPIDeployment: {'default_RAYFastAPIDeployment#DRDeIb', 'default_RAYFastAPIDeployment#LytUUR'}.
2023-10-29 15:13:27,619	ERROR dataclient.py:312 -- Callback error:
Traceback (most recent call last):
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/dataclient.py", line 301, in _process_response
    can_remove = callback(response)
  File "/opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/dataclient.py", line 179, in __call__
    self.c

![NLP API Architecture](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/nlp_api_arch.png)

In [10]:
input_text = '''
The Blob has long captivated my imagination, emerging as the quintessential cinematic nightmare:
an insatiable, amoebic entity with the eerie ability to breach virtually any defense, 
ominously described by a fated scientist as  assimilating flesh on contact. 
Mocking parallels to gelatin are futile for this concept embodies the gravest of implications, akin to the cataclysmic 
gray goo scenario envisioned by technophiles haunted by the specter of runaway artificial intelligence '''

In [11]:
import requests
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__)
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('status_code'))
print(requests.get("http://10.224.1.8:8000/check", params={"input_text": input_text}).__dict__.get('_content').decode('utf8'))

{'_content': b'[-0.18333333333333335, -0.6]', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'date': 'Sun, 29 Oct 2023 15:13:41 GMT', 'server': 'uvicorn', 'content-type': 'application/json', 'ray_serve_request_id': 'gihfIuAZga', 'Transfer-Encoding': 'chunked'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f30b820ab50>, 'url': 'http://10.224.1.8:8000/check?input_text=%0AThe+Blob+has+long+captivated+my+imagination%2C+emerging+as+the+quintessential+cinematic+nightmare%3A%0Aan+insatiable%2C+amoebic+entity+with+the+eerie+ability+to+breach+virtually+any+defense%2C+%0Aominously+described+by+a+fated+scientist+as++assimilating+flesh+on+contact.+%0AMocking+parallels+to+gelatin+are+futile+for+this+concept+embodies+the+gravest+of+implications%2C+akin+to+the+cataclysmic+%0Agray+goo+scenario+envisioned+by+technophiles+haunted+by+the+specter+of+runaway+artificial+intelligence+', 'encoding': 'utf-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'ela

## conclusion

Ray Serve is an adaptable model deployment framework designed for constructing real-time inference APIs. It's framework-agnostic, allowing you to use a single toolkit to serve a wide range of models, including deep learning models created with popular frameworks like PyTorch, TensorFlow, and Keras, as well as Scikit-Learn models and custom Python business logic. This versatile tool boasts an array of features and performance enhancements, such as response streaming, dynamic request batching, and multi-node/multi-GPU support, making it well-suited for handling Large Language Models and other demanding tasks.
What sets Ray Serve apart is its proficiency in orchestrating the composition of multiple machine learning models and business logic components within a single Python-based inference service. Leveraging the power of Ray, it seamlessly scales across multiple machines and offers flexible scheduling capabilities, including fractional GPU allocation, which optimisation resource sharing and enables cost-effective deployment of a multitude of machine learning models.