Find ways for sustainable operation of OA inference (call for brainstorming) #2806

andreaskoepf · 2023-04-21T09:04:21Z

In order to provide inference online at open-assistant.io/chat/ for a longer period of time we need to find a sustainable solution that covers the high costs of operation. This issue is a call to gather ideas and opinions how OA could tackle this challenge.

Background: For launch we have (up to now) through our extremely generous and supportive sponsors StabilityAI & Huggingface a larger number of A100 80 GB GPUs on which our compute inference runs (brrrrr). Each of these GPUs currently can serve only two requests at a time with a float16 30B model, i.e. while a user sees text streamed to her/him on the other side a A100 GPU is 50% occupied. We haven't calculated the exact costs per chat-request yet, but it is obvious that renting the necessary GPUs from popular GPU cloud providers would cost a lot (e.g. lambda labs offers a 8x A100 80GB for 12 USD/h - availability of these pods is another issue).

Here are three first ideas which Yannic and I had in a private discussion:

Build a network of compute sponsors with partners like universities, businesses and wealthy private individuals who either have compute on-site or who are willing to rent a cloud GPU and run our docker image for a some time.
Work together with commercial entities to build a refinancing system on top of OA.
Create a blockchain/token based solution (ETH or something like truthgpt's $TRUTH).

If you have further ideas for sustainable operation or cost reduction or you could help in organizing a partner network or token-based solution please comment here or contact us on the OA discord server.

yk · 2023-04-21T09:10:26Z

Note: help with more efficient inference is welcome, but this issue is about sustainability

andrewm4894 · 2023-04-21T09:44:38Z

Make it easy for individuals to donate one off within the app, similar to Wikipedia. Obviously not going to be nearly enough to cover costs but maybe could help and at least be a good norm to try nudge user towards.

andrewm4894 · 2023-04-21T09:49:52Z

Also, dare I say it, some ad's in the app, in a nice way similar to how readthedocs does it

https://docs.readthedocs.io/en/stable/advertising/ethical-advertising.html

dominiquegarmier · 2023-04-21T11:21:02Z

i saw this yesterday https://github.com/zilliztech/GPTCache. Perhaps we could use a smaller model to then "personalize" after we get a cache hit (im not sure if gptcache already does that).

Shaykuu · 2023-04-25T17:34:55Z

Why not have the same model as OpenAI? As a not-very-technical developer, I would be happy to pay per usage. It may be costly to implement, but if there's demand for it, it may be profitable in the long run. Here are the steps I'd take:
1 - evaluate the fixed and variable costs involved in the implementation of this plan
2 - estimate the demand for token-based API fees
3 - model the financials of building the solution, charging variable fees + margin (margin goes to recoup fixed fees, then invest in more features)
4 - include a sensitivity matrix to reflect the uncertainty in demand
5 - knock at the doors of banks and investors to make it happen
6 - develop the token-based API fees feature

Is there someone on the team able to execute this, or might you have executive support for it?

if you want my help with this, contact me at david@transparent.services

bitplane · 2023-04-25T17:46:56Z

I'd be against charging directly or using crypto-tokens (at least initially) since they have the stench of pyramid schemes and rug-pulls. I think it would be better for people to contribute in ways that serve the community, we just need to make that easy and verifiable;

Multiple, smaller models. So more people can donate GPU time and choose less expensive models.
Make contributing GPU time very easy. "Push this button to deploy to $hosting_provider for $time_period for $price, or this one to run it locally". This adds a node to the cluster and verifies it's connected, and counts the tokens it's generating, at what speed, for what model and generates credits for the user. Let users see their nodes and how much they're generating. (if cryptocurrency is involved here, make it platform-to-platform rather than user-to-platform; integrate other platforms as inference sources)
GPU contributors get queue-barging rights, but only for $x% of their contribution. So they skip the queue if they have credits on the system, otherwise they wait in line for a model like everyone else. The token surplus is shared as a dividend between (proven, active, non-bot) users/contributors.
Run out of credits? You're in the slow lane with everyone else. Choose a smaller model, do some tasks, watch some adverts, invite some people who generate credits and so on - gamify the credit generation process to increase exposure, compute and contribution.

If crypto-tokens need to be the way forward, use something that's directly pegged to the dollar like DAI, rather than something that invites speculative bubbles and the endless reaming butthurt that is associated with these things.

GuilleHoardings · 2023-04-25T21:23:10Z

In my opinion, there are two different things: the project that generates the software, the data and the models, and a hosted chat service. The first one can progress with only passionate collaborators, but the second one needs a company or a foundation that handles the economics, unless it is based on some kind of peer-to-peer network, which could only provide a best-effort service.

I was not expecting Open Assistant to be a hosted chat service, but the base to self-host or to give others the possibility to create hosted services with the software and models provided by Open Assistant. Getting into the serving business is a daunting task that requires a business structure and an operations team.

I think that going into the serving field is not sustainable (unless the company/foundation route is desired): if the service works well, many more people will be using it, and the problem will become bigger.

So, I think that a route forward is not to market Open Assistant as a ChatGPT service alternative, but as an open source alternative to the code and the models. A demo service could be provided, but users could be educated about the nature of the service so that they understand its limitations and don't have unrealistic expectations.

In addition, the hosted chat service is already taking precious time from collaborators that could work in other parts of the system. My impression was that bringing models to consumer hardware was a more important goal of the Open Assistant project than working as a chat service.

This is just my opinion. I'm not opposed to any path forward, and I'm amazed and very grateful for the great achievements of the collaborators that I see in this project.

tyeestudio · 2023-04-27T02:00:52Z

efficiency ⊆ sustainability ⊆ efficiency. here is the ideal/design-architect.
oa can invest some efforts(1) on creating a new (very-simple if possible) layer before inference. this layer act as efficiency-engine. Sub-LInear Deep Learning Engine(SLIDE, algorithm introduced by Chen et al, 2019 ) proved [cpu] can be more efficient than [gpu], by using hashing-table in the core design/implementation. the bottleneck for SLIDE is the memory usage, which may need another set of efforts(2) from oa, is to find a way/algorithm(very-simple if possible) to cut the model into chain-of-small-models, and be able to inference them from each small model, then backward-chain-the-results back.

andreaskoepf · 2023-04-27T13:48:46Z

Thanks everyone for your valuable input.

I propose to focus on human-teaching (feedback collection) & building a highly-efficient model-factory instead of a scalable inference chat-hosting service. If someone wants to access our models via API they simply should do this via the HuggingFace inference system (or potentially offerings by other providers).

eldila · 2023-04-28T20:06:27Z

I agree with GuilleHoardings. I think the focus should be on the software, data, and models. Not on hosting.

The biggest value of OA is letting small businesses, individuals, and institutions to keep control of their technology stack. You can do this without hosting.

Another option would be to allow people to register Compute Agents that could run on local hardware. A user could register an agent within their OA account giving them access to compute. You could introduce a community tax which would allow other people to use other people’s agents when not in use (capping it at some reasonable threshold of GPU cycles).

In all honesty though, this second option sounds more like a business than a open source project.

andrewm4894 · 2023-04-28T20:13:38Z

I propose to focus on human-teaching (feedback collection) & building a highly-efficient model-factory instead of a scalable inference chat-hosting service.

Do you not sort of need the "chat-hosting service" to be able to really scale the "human-teaching (feedback collection)" stuff - or do you mean lower volume but higher quality or more focused/targeted?

I could imagine the really useful data ending up being the annotated chats themselves - to get the flywheel going to be able to continually improve the models. If someone eg huggingface for example puts their own frontend on the OA models then the chats and thumbs data to maybe keep improving them ends up with them so wondering if there is something very useful about some form of self-hosting still being useful in terms of the data collection.

eg for a long time i used to actually feed and use https://movielens.org/ for my recommendations, explicitly because my thumbs data was going into open research and making their datasets and models better. I sort of feel the same way about using https://open-assistant.io/chat

andreaskoepf · 2023-04-30T12:47:48Z

Do you not sort of need the "chat-hosting service" to be able to really scale the "human-teaching (feedback collection)" stuff - or do you mean lower volume but higher quality or more focused/targeted?

An inference system is needed, but we could focus more on "collective conversations", e.g. collecting feedback to responses generated to prompts submitted by other users. Third party compute providers (like StabilityAI or Huggingface) could host the helpful consumer-facing assistant. We could also add a "oa-credits" system that would allow using the assistant or our API which could be earned for feedback. Inference for a teaching system probably would cost < 150k USD pre year, e.g. like running a single 8 GPU pod .. while operating large scale general purpose inference would cost several millions.

The mission of OA should be to provide the best open-source models and human feedback data and maybe a general purpose quickly deployable inference system (without mandatory human feedback collection) that could be used by others to host the OA models.

andreaskoepf · 2023-04-30T14:09:29Z

Thanks everyone for your input.

zenchantlive · 2023-04-30T14:16:44Z

I highly suggest a free base model that has restrictions and then a minial monthly fee for "unlimited" use like MJ says, or in other words, high use allotments that can be slowed to retain gpu hours.

I also, for one, would bs absolutely down to do a monthly fee for special festures, like maybe some plugins like bot traders cost to use? Just an idea. Love you all, dearly for youe hard work. Once I get pain ill be sending some skrill your way 😅💙

flowpoint · 2023-05-02T12:03:31Z

I think we could maybe expose the cost to the user and mostly pass the payment through to the cloud providers.
Then we could take a affiliate fee for innovating.
This way the monopol problem of chatgpt and the likes would be alleviated.
So basically:

button: chat with openassitant (5 free inferences)

button: "launch openassistant on aws (affiliate, 0.10 $ to openassistant) total: 2.00 /hour"

button: "launch openassistant on huggingface (affiliate, 0.10 $ to openassistant) total: 2.00 /hour"

button: "use sponsored api, free, stability.ai"

button: "use volunteer instance, free"

button: "selfhost, always free"

zenchantlive · 2023-05-03T17:07:14Z

I'd be against charging directly or using crypto-tokens (at least initially) since they have the stench of pyramid schemes and rug-pulls. I think it would be better for people to contribute in ways that serve the community, we just need to make that easy and verifiable;

Multiple, smaller models. So more people can donate GPU time and choose less expensive models.

Make contributing GPU time very easy. "Push this button to deploy to $hosting_provider for $time_period for $price, or this one to run it locally". This adds a node to the cluster and verifies it's connected, and counts the tokens it's generating, at what speed, for what model and generates credits for the user. Let users see their nodes and how much they're generating. (if cryptocurrency is involved here, make it platform-to-platform rather than user-to-platform; integrate other platforms as inference sources)

GPU contributors get queue-barging rights, but only for $x% of their contribution. So they skip the queue if they have credits on the system, otherwise they wait in line for a model like everyone else. The token surplus is shared as a dividend between (proven, active, non-bot) users/contributors.

Run out of credits? You're in the slow lane with everyone else. Choose a smaller model, do some tasks, watch some adverts, invite some people who generate credits and so on - gamify the credit generation process to increase exposure, compute and contribution.

If crypto-tokens need to be the way forward, use something that's directly pegged to the dollar like DAI, rather than something that invites speculative bubbles and the endless reaming butthurt that is associated with these things.

I mean jeez this is perfect. all of these should be implemented.

andreaskoepf added needs discussion inference labels Apr 21, 2023

andreaskoepf mentioned this issue Apr 21, 2023

An API platform? #2792

Closed

dominiquegarmier mentioned this issue Apr 26, 2023

idea: Public Data collection API #2925

Closed

andreaskoepf closed this as completed Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find ways for sustainable operation of OA inference (call for brainstorming) #2806

Find ways for sustainable operation of OA inference (call for brainstorming) #2806

andreaskoepf commented Apr 21, 2023 •

edited

Loading

yk commented Apr 21, 2023

andrewm4894 commented Apr 21, 2023

andrewm4894 commented Apr 21, 2023

dominiquegarmier commented Apr 21, 2023 •

edited

Loading

Shaykuu commented Apr 25, 2023 •

edited

Loading

bitplane commented Apr 25, 2023 •

edited

Loading

GuilleHoardings commented Apr 25, 2023 •

edited

Loading

tyeestudio commented Apr 27, 2023 •

edited

Loading

andreaskoepf commented Apr 27, 2023

eldila commented Apr 28, 2023 •

edited

Loading

andrewm4894 commented Apr 28, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023

zenchantlive commented Apr 30, 2023

flowpoint commented May 2, 2023

zenchantlive commented May 3, 2023

Find ways for sustainable operation of OA inference (call for brainstorming) #2806

Find ways for sustainable operation of OA inference (call for brainstorming) #2806

Comments

andreaskoepf commented Apr 21, 2023 • edited Loading

yk commented Apr 21, 2023

andrewm4894 commented Apr 21, 2023

andrewm4894 commented Apr 21, 2023

dominiquegarmier commented Apr 21, 2023 • edited Loading

Shaykuu commented Apr 25, 2023 • edited Loading

bitplane commented Apr 25, 2023 • edited Loading

GuilleHoardings commented Apr 25, 2023 • edited Loading

tyeestudio commented Apr 27, 2023 • edited Loading

andreaskoepf commented Apr 27, 2023

eldila commented Apr 28, 2023 • edited Loading

andrewm4894 commented Apr 28, 2023 • edited Loading

andreaskoepf commented Apr 30, 2023 • edited Loading

andreaskoepf commented Apr 30, 2023

zenchantlive commented Apr 30, 2023

flowpoint commented May 2, 2023

zenchantlive commented May 3, 2023

andreaskoepf commented Apr 21, 2023 •

edited

Loading

dominiquegarmier commented Apr 21, 2023 •

edited

Loading

Shaykuu commented Apr 25, 2023 •

edited

Loading

bitplane commented Apr 25, 2023 •

edited

Loading

GuilleHoardings commented Apr 25, 2023 •

edited

Loading

tyeestudio commented Apr 27, 2023 •

edited

Loading

eldila commented Apr 28, 2023 •

edited

Loading

andrewm4894 commented Apr 28, 2023 •

edited

Loading

andreaskoepf commented Apr 30, 2023 •

edited

Loading