# Maximizing Online Inference: Exploring Ray Serve and Diverse Methodologies

Ray Serve is an adaptable model deployment framework designed for constructing real-time inference APIs. It's framework-agnostic, allowing you to use a single toolkit to serve a wide range of tools,models and services.

![Serve Positioning](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/serve_positioning.png)

# Navigating Ray Serve through RayServeSyncHandle's Methodological Approach

In [1]:
import ray
from ray.job_submission import JobSubmissionClient
import time
from ray.runtime_env import RuntimeEnv

# Ray cluster information
ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local"
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"

# Submit Ray job using JobSubmissionClient
client = JobSubmissionClient(ray_address)
job_id = client.submit_job(
    entrypoint="python snycheck.py",
    runtime_env={
        "working_dir": "./",
        "pip": ["textblob"],
        "env_vars": {"http_proxy":"http://<proxy_host>:<port>","https_proxy":"http://<proxy_host>:<port>"}
    },
    entrypoint_num_cpus=1
)

print(client.__dict__)
print(f"Ray job submitted with job_id: {job_id}")

# Wait for a while to let the jobs run
time.sleep(1)

job_status = client.get_job_status(job_id)
get_job_logs = client.get_job_logs(job_id)
get_job_info = client.get_job_info(job_id)
print(f"Ray job status for job_id {job_id}: {job_status}")
print(f"Ray job logs for job_id {job_id}: {get_job_logs}")
print(f"Ray job info for job_id {job_id}: {get_job_info}")
async for lines in client.tail_job_logs(job_id):
    print(lines, end="") 

# Disconnect from the Ray cluster
ray.shutdown()

2024-01-12 22:40:18,353	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_d1bbb4a7b4b1de39.zip.
2024-01-12 22:40:18,355	INFO packaging.py:518 -- Creating a file package for local directory './'.


{'_client_ray_version': '2.7.0', '_address': 'http://kuberay-head-svc.kuberay.svc.cluster.local:8265', '_cookies': None, '_default_metadata': {}, '_headers': None, '_verify': True, '_ssl_context': None}
Ray job submitted with job_id: raysubmit_QGnr4Qj7Q93Fuzzd
Ray job status for job_id raysubmit_QGnr4Qj7Q93Fuzzd: PENDING
Ray job logs for job_id raysubmit_QGnr4Qj7Q93Fuzzd: 
Ray job info for job_id raysubmit_QGnr4Qj7Q93Fuzzd: type=<JobType.SUBMISSION: 'SUBMISSION'> job_id=None submission_id='raysubmit_QGnr4Qj7Q93Fuzzd' driver_info=None status=<JobStatus.PENDING: 'PENDING'> entrypoint='python snycheck.py' message='Job has not started yet. It may be waiting for resources (CPUs, GPUs, custom resources) to become available. It may be waiting for the runtime environment to be set up.' error_type=None start_time=1705099226532 end_time=None metadata={} runtime_env={'working_dir': 'gcs://_ray_pkg_d1bbb4a7b4b1de39.zip', 'pip': {'packages': ['textblob'], 'pip_check': False}, 'env_vars': {'http_pro

![NLP API Architecture](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_08/nlp_api_arch.png)

# The following code will reads the loggs and offers the right host node ip and port

In [None]:
# print(lines, end="") 
import re
pattern = r"http_proxy (\d+\.\d+\.\d+\.\d+).*?port (\d+)"
matches = re.findall(pattern, lines)

# Display the extracted values
for match in matches:
    print(f"http_proxy: {match[0]}, port: {match[1]}")

In [4]:
input_text = '''
The Blob has long captivated my imagination, emerging as the quintessential cinematic nightmare:
an insatiable, amoebic entity with the eerie ability to breach virtually any defense, 
ominously described by a fated scientist as  assimilating flesh on contact. 
Mocking parallels to gelatin are futile for this concept embodies the gravest of implications, akin to the cataclysmic 
gray goo scenario envisioned by technophiles haunted by the specter of runaway artificial intelligence '''

In [5]:
import requests
print(requests.get(f"http://{match[0]}:{match[1]}/check", params={"input_text": input_text}).__dict__)
print(requests.get(f"http://{match[0]}:{match[1]}/check", params={"input_text": input_text}).__dict__.get('status_code'))
print(requests.get(f"http://{match[0]}:{match[1]}/check", params={"input_text": input_text}).__dict__.get('_content').decode('utf8'))

{'_content': b'', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'date': 'Fri, 12 Jan 2024 22:44:04 GMT', 'server': 'envoy', 'content-type': 'text/plain', 'ray_serve_request_id': '3a9dc65a-ce3e-4ba8-a62b-da9b1fbcba03', 'x-request-id': '3a9dc65a-ce3e-4ba8-a62b-da9b1fbcba03', 'x-envoy-upstream-service-time': '5324', 'transfer-encoding': 'chunked'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f27f48a4d60>, 'url': 'http://10.224.105.197:9000/check?input_text=%0AThe+Blob+has+long+captivated+my+imagination%2C+emerging+as+the+quintessential+cinematic+nightmare%3A%0Aan+insatiable%2C+amoebic+entity+with+the+eerie+ability+to+breach+virtually+any+defense%2C+%0Aominously+described+by+a+fated+scientist+as++assimilating+flesh+on+contact.+%0AMocking+parallels+to+gelatin+are+futile+for+this+concept+embodies+the+gravest+of+implications%2C+akin+to+the+cataclysmic+%0Agray+goo+scenario+envisioned+by+technophiles+haunted+by+the+specter+of+runaway+artificial+intelligence+',

## conclusion

Ray Serve is an adaptable model deployment framework designed for constructing real-time inference APIs. It's framework-agnostic, allowing you to use a single toolkit to serve a wide range of models, including deep learning models created with popular frameworks like PyTorch, TensorFlow, and Keras, as well as Scikit-Learn models and custom Python business logic. This versatile tool boasts an array of features and performance enhancements, such as response streaming, dynamic request batching, and multi-node/multi-GPU support, making it well-suited for handling Large Language Models and other demanding tasks.
What sets Ray Serve apart is its proficiency in orchestrating the composition of multiple machine learning models and business logic components within a single Python-based inference service. Leveraging the power of Ray, it seamlessly scales across multiple machines and offers flexible scheduling capabilities, including fractional GPU allocation, which optimisation resource sharing and enables cost-effective deployment of a multitude of machine learning models.