# Runhouse

The [Runhouse](https://github.com/run-house/runhouse) allows remote compute and data across environments and users. See the [Runhouse docs](https://runhouse-docs.readthedocs-hosted.com/en/latest/).

This example goes over how to use LangChain and [Runhouse](https://github.com/run-house/runhouse) to interact with models hosted on your own GPU, or on-demand GPUs on AWS, GCP, AWS, or Lambda.

**Note**: Code uses `SelfHosted` name instead of the `Runhouse`.

In [14]:
%pip install --upgrade --quiet git+https://github.com/run-house/runhouse.git@sb/fixes_langchain_integration#egg=runhouse
%pip install --upgrade --quiet "skypilot[aws]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import runhouse as rh
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_community.llms import SelfHostedHuggingFaceLLM, SelfHostedPipeline
from langchain_community.llms.self_hosted_hugging_face import _generate_text, _load_transformer

In [5]:
!runhouse status "/sashab/sasha-ondemand-cluster"

INFO | 2024-03-11 14:19:59.384189 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2024-03-11 14:20:00.212407 | Authentication (publickey) successful!
INFO | 2024-03-11 14:20:01.223769 | Server sasha-ondemand-cluster is up.
[1;94m😈 Runhouse Daemon is running 🏃[0m
[35m/sashab/[0m[95msasha-ondemand-cluster[0m
• server_port: [1;36m32300[0m
• server_connection_type: ssh
• den_auth: [3;91mFalse[0m
• backend config:
        • resource_type: cluster
        • resource_subtype: OnDemandCluster
        • provenance: [3;35mNone[0m
        • visibility: private
        • ips: [1m[[0m[32m'34.205.76.106'[0m[1m][0m
        • use_local_telemetry: [3;91mFalse[0m
        • ssh_port: [1;36m22[0m
        • instance_type: [3;35mNone[0m
        • num_instances: [3;35mNone[0m
        • provider: cheapest
        • autostop_mins: [1;36m-1[0m
        • open_ports: [1m[[0m[1m][0m
        • use_spot: [3;91mFalse[0m
        • image_id: [3;35mNone[0m
        • region: [3

In [2]:
# For an on-demand A100 with GCP, Azure, or Lambda
gpu = rh.cluster(name="rh-a10x", instance_type="A100:1", use_spot=False)

# For an on-demand A10G with AWS (no single A100s on AWS)
# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')

# For an existing cluster
# gpu = rh.cluster(ips=['<ip of the cluster>'],
#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
#                  name='rh-a10x')

In [3]:
gpu.up()

I 03-11 16:10:32 optimizer.py:1206] No resource satisfying <Cloud>({'A100': 1}) on AWS.
I 03-11 16:10:32 optimizer.py:1210] Did you mean: [36m['A100-80GB:8', 'A100:8'][0m


ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task(run=<empty>)
  resources: <Cloud>({'A100': 1}).

To fix: relax or change the resource requirements.
Try one of these offered accelerators: [36m['A100-80GB:8', 'A100:8'][0m

Hint: [1msky show-gpus[0m to list available accelerators.
      [1msky check[0m to check the enabled clouds.

In [17]:
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

In [18]:
model_env = rh.env(
    reqs=["transformers", "torch", "langchain"],
    secrets=["huggingface"]  # need to download  google/gemma-2b-it
)

In [19]:
load_transformer_remote = rh.function(fn=_load_transformer).to(system=gpu, env=model_env)

I 03-11 16:08:02 optimizer.py:1206] No resource satisfying <Cloud>({'A100': 1}) on AWS.
I 03-11 16:08:02 optimizer.py:1210] Did you mean: [36m['A100-80GB:8', 'A100:8'][0m


ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task(run=<empty>)
  resources: <Cloud>({'A100': 1}).

To fix: relax or change the resource requirements.
Try one of these offered accelerators: [36m['A100-80GB:8', 'A100:8'][0m

Hint: [1msky show-gpus[0m to list available accelerators.
      [1msky check[0m to check the enabled clouds.

In [6]:
generate_text_remote = rh.function(_generate_text).to(system=gpu, env=model_env)

INFO | 2024-03-11 09:09:33.018720 | Copying package from file:///Users/sashabelousovrh/PycharmProjects/LangchainIntegration/langchain to: rh-a10x
INFO | 2024-03-11 09:09:35.288750 | Calling base_env.install
INFO | 2024-03-11 09:09:36.599561 | Time to call base_env.install: 1.31 seconds


Output()

INFO | 2024-03-11 09:09:40.330158 | Sending module _generate_text to rh-a10x


Output()

In [7]:
#llm = SelfHostedHuggingFaceLLM(name="gemma-2b-it", model_id="gemma-2b-it", model_load_fn=load_transformer_remote, inference_fn=generate_text_remote).to(gpu, env=model_env)
self_hosted_llm = SelfHostedHuggingFaceLLM(name="gemma-2b-it",
                                           model_id="gemma-2b-it",
                                           model_load_fn=load_transformer_remote,
                                           inference_fn=generate_text_remote).to(system=gpu, env=model_env)

INFO | 2024-03-11 09:09:47.353149 | Calling _load_transformer.call


[36mNo module named 'langchain_core'
[0m[36mTraceback (most recent call last):
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/runhouse/servers/env_servlet.py", line 38, in wrapper
[0m[36m    output = func(*args, **kwargs)
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/runhouse/servers/env_servlet.py", line 113, in call_local
[0m[36m    return obj_store.call_local(
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/runhouse/servers/obj_store.py", line 951, in call_local
[0m[36m    res = method(*args, **kwargs)
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/runhouse/resources/functions/function.py", line 114, in call
[0m[36m    fn = self._get_obj_from_pointers(*self.fn_pointers)
[0m[36m  File "/opt/conda/lib/python3.10/site-packages/runhouse/resources/module.py", line 321, in _get_obj_from_pointers
[0m[36m    obj_store.imported_modules[module_name] = importlib.import_module(
[0m[36m  File "/opt/conda/lib/python3.10/importlib/__init__.p

ERROR | 2024-03-11 09:09:48.706122 | [36mError calling call on _load_transformer on server: gAWVZwAAAAAAAACMCGJ1aWx0aW5zlIwTTW9kdWxlTm90Rm91bmRFcnJvcpSTlIwgTm8gbW9kdWxl
IG5hbWVkICdsYW5nY2hhaW5fY29yZSeUhZRSlH2UjARuYW1llIwObGFuZ2NoYWluX2NvcmWUc2Iu
[0m
ERROR | 2024-03-11 09:09:48.885432 | [36mTraceback: gAWVcQcAAAAAAABYagcAAFRyYWNlYmFjayAobW9zdCByZWNlbnQgY2FsbCBsYXN0KToKICBGaWxl
ICIvb3B0L2NvbmRhL2xpYi9weXRob24zLjEwL3NpdGUtcGFja2FnZXMvcnVuaG91c2Uvc2VydmVy
cy9lbnZfc2VydmxldC5weSIsIGxpbmUgMzgsIGluIHdyYXBwZXIKICAgIG91dHB1dCA9IGZ1bmMo
KmFyZ3MsICoqa3dhcmdzKQogIEZpbGUgIi9vcHQvY29uZGEvbGliL3B5dGhvbjMuMTAvc2l0ZS1w
YWNrYWdlcy9ydW5ob3VzZS9zZXJ2ZXJzL2Vudl9zZXJ2bGV0LnB5IiwgbGluZSAxMTMsIGluIGNh
bGxfbG9jYWwKICAgIHJldHVybiBvYmpfc3RvcmUuY2FsbF9sb2NhbCgKICBGaWxlICIvb3B0L2Nv
bmRhL2xpYi9weXRob24zLjEwL3NpdGUtcGFja2FnZXMvcnVuaG91c2Uvc2VydmVycy9vYmpfc3Rv
cmUucHkiLCBsaW5lIDk1MSwgaW4gY2FsbF9sb2NhbAogICAgcmVzID0gbWV0aG9kKCphcmdzLCAq
Kmt3YXJncykKICBGaWxlICIvb3B0L2NvbmRhL2xpYi9weXRob24zLjEwL3NpdGUtcGFja2FnZXMv
cn

TypeError: exceptions must derive from BaseException

In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"

llm_chain.run(question)

You can also load more custom models through the SelfHostedHuggingFaceLLM interface:

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_id="google/flan-t5-small",
    task="text2text-generation",
    hardware=gpu,
)

In [None]:
llm("What is the capital of Germany?")

Using a custom load function, we can load a custom pipeline directly on the remote hardware:

In [None]:
def load_pipeline():
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        pipeline,
    )

    model_id = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)
    pipe = pipeline(
        "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
    )
    return pipe


def inference_fn(pipeline, prompt, stop=None):
    return pipeline(prompt)[0]["generated_text"][len(prompt) :]

In [None]:
llm = SelfHostedHuggingFaceLLM(
    model_load_fn=load_pipeline, hardware=gpu, inference_fn=inference_fn
)

In [None]:
llm("Who is the current US president?")

You can send your pipeline directly over the wire to your model, but this will only work for small models (<2 Gb), and will be pretty slow:

In [None]:
pipeline = load_pipeline()
llm = SelfHostedPipeline.from_pipeline(
    pipeline=pipeline, hardware=gpu, model_reqs=["pip:./", "transformers", "torch"]
)

Instead, we can also send it to the hardware's filesystem, which will be much faster.

In [None]:
import pickle

rh.blob(pickle.dumps(pipeline), path="models/pipeline.pkl").save().to(
    gpu, path="models"
)

llm = SelfHostedPipeline.from_pipeline(pipeline="models/pipeline.pkl", hardware=gpu)