# Bridging the gap in enterprise LLM usage

(Written Oct 2023)

### Contents
1.	The LLM opportunity for enterprises
2.	The gap preventing LLM adoption at enterprises
3.	Why Finetuning is (mostly) a bad idea
4.	A 12-month plan to own LLM implementation
5.	How a large consultant could build a better model than OpenAI



### The LLM opportunity for enterprises
Companies will heavily adopt language model strategies in the next decade:
1.	LLMs can already perform basic text analysis and generation tasks like marketing writing, email drafting or summarization better and cheaper than humans. There's a big cost opportunity.
2.	LLMs also already can perform new non-human tasks like analyzing the connections in large text datasets. There will be a big opportunity to generate superior work.
3.	Models continue to get better and cheaper – the business case to use them will improve every year.



### The gap preventing LLM adoption at enterprises
That said, there is a big gap between the models and business use cases:
1.	Models and model tooling are developer oriented.
2.	The outputs of models are poorly understood, and carry significant security and privacy risks.
3.	Models can convincingly lie and often invent facts in a way that is difficult to falsify.
4.	The best models are exposed through external APIs, where downstream tooling and analysis is a black box. The companies who run the best models are highly secretive about what exactly happens to the input and how precisely data is used. These companies also have ideological missions around AGI that create unknown risks for consumers of their products in terms of availability and data security. (May 2024 - since writing these companies have become more corporate, but this is still a risk!)
5.	There are unclear IP and legal issues around current models that could lead to their withdrawal from the market or significant price increases.
6.	There is a high level of complexity in getting the best out of the models – strategies like finetuning, retrieval, internal attention, system prompts all must be created, tested and implemented. This is exacerbated by the fact that the tooling on the other side of the main model APIs seems to frequently change.
7.	The ecosystem is rapidly changing, with new innovations occurring on a regular basis that few companies can keep up with.




### Why Finetuning is (mostly) a bad idea
Moreover, due to the gold rush in AI, dangerous advice is commonly given. For example, one of the main strategies recommended/offered to companies (including by McKinsey, BCG, OpenAI, Anthropic, Cohere and Accenture) is finetuning models on their company data. 
Finetuning makes sense at a high level – proprietary data is one of the most valuable assets companies have to improve language models and prominent examples like BloombergGPT have shown the potential of proprietary data in model training.
However, finetuning is a dangerous idea for a number of important reasons:
1.	Without a lot of expertise, finetuning can easily destroy capabilities and make the model worse. To give one demonstration of why this can occur, consider finetuning a model on 10-K annual reports to produce a business-oriented model. One of the most common completions of “the ____” in such data is “the company”. This completion occurs with far higher frequency than in the pre-trained dataset but is not something you actually want the model to learn. Another issue is that finetuning the model to complete factual statements that it does not know the answer to (e.g. something that has occurred after its pretraining data was collected) does not teach the model new knowledge, but rather encourages the model to lie or guess in response to such questions. 
2.	Models are compressors of their training data. A model trained on sensitive company data becomes a single point of failure for that data from a cybersecurity point of view. If the model weights leak, your company’s sensitive data can then be easily extracted from the model.
3.	Building on 2, a finetuned model is an access control nightmare. You want to use your best and most valuable data in the model, but then that data can be extracted by users across your organization – you wouldn’t let a random junior team member read all your CEO’s emails, but that’s effectively what is enabled by a widely accessible model that has been broadly finetuned. And beyond accidental exposure, corporate espionage becomes far easier in this context – the spy can just extract sensitive data by using the model.
4.	Language Models learn static data distributions and do not have a sense of time. See for example the knowledge cutoffs that have been embedded into ChatGPT.  This means that your most recent data, or data that is changing over time, cannot be effectively incorporated by the model. Consider a scenario where your CEO changes, or two large competitors merge – most of the model’s finetuning data is based on a context prior to this occurring, and the finetuning does not distinguish temporally.
5.	Finetuning is much more complex from an infrastructure point of view than inference, particularly for a cutting-edge model. E.g. GPT-3.5 inference is very easy – just call the API – but finetuning requires extensive data preparation, evaluation, hyperparameter management etc. 
Companies probably should finetune, but they must be very careful about it! Most companies just don’t have the capability to finetune well. 

Finetuning is far from the only place where there is a big expertise gap for enterprise usage.




### A plan to own LLM implementation
There’s a big opportunity here to be the vanguard in helping companies develop and run highly capable LLM systems. Azure OpenAI Service and AWS Bedrock offer managed solutions with a slightly different focus, but each have a low touch sales process and require high expertise at the customer to be effectively used. We can not only help customers use these services, but also help them run their own highly capable models if they want maximum control, privacy and flexibility.

This roadmap defines the key areas where this enterprise gap can be filled. It can easily be accomplished in less than a year with a small amount of investment and commitment. 
1.	Partner with a foundation model company who are focused on open source (as they will allow the weights to be directly customized and given to end customers). With this partner, build a series of high-quality models sliced into use cases (e.g. marketing, strategy, legal etc.) and by industry vertical. These models should be sized to allow single GPU inference (only data parallelism needed, no tensor/expert/pipeline complexity). The point is to give customers fixed models, where they control and understand every stage of the inference process, but that are very simple to run and scale across the organization. 
2.	Build a set of large datasets that can be used in the above finetuning, as well as to improve third party models (e.g. as part of finetuning on AWS Bedrock with a provider like Anthropic). Sources can be SEC filings, transcripts, cases, expert calls etc.
3.	Similarly, build a structured RL dataset that can be centrally built and sold to clients. There is no point in every customer building their own RL data except in very specific scenarios.
4.	Build a 3-stage pipeline for companies to incorporate sensitive and temporal data into their models. Help them understand what data to include at each stage. 
Critically, this multi-stage pipeline allows companies to build models that know proprietary data about them, can be quickly updated every week with their latest data, have no capability loss vs the base model, can be cheaply run in their own cloud environment with limited inference complexity, pose no security or access control risk, and are fully under the control of the company.
5.	Develop a standardized software package for effective retrieval/internal attention/context summarization/temporal context.
6.	Build great new evals and LLM testing for business. All current model evals are a mess and speak to general reasoning/coding. Help companies understand the quality of the systems they use for the problems they face, not on abstract reasoning and coding benchmarks. Consultants have unique capabilities to build these enterprise oriented datasets, and even evals for specific companies.
7.	Build great query routing to let companies move between different models for different use cases. This should include a better set of embeddings that is finetuned for each use case – current embedding systems are bad and poorly suited for enterprise use cases.
8.	Develop great explainability, sourcing, reasoning and safety systems. On our own models, we would have direct access to the logit level (and caching capabilities).
9.	Be prepared to build application software on top of LLM systems in response to customer demand. We’ll be uniquely positioned to see how different parts of organization want to use LLMs, and we can respond to that by building systems that address those use cases directly.
10.	Have a high touch sales and customer success program to help customers understand our tools and how to use them. 
All of the above items should be modular – customers should be able to easily plug in any of our solutions with whatever model, retrieval etc. that they want. Our goal is dominate the implementation of LLM systems at big companies, not own every part of the stack.  Right now there are really good foundation models behind external APIs, but everything else in the ecosystem sucks and expertise is very limited. That’s the opportunity for us.



### How a consultant could build a better model than OpenAI
There is massive untapped potential in LLMs for large consultants. McKinsey, for example, are using what appears to be an simple retrieval augmented system called [Lilli](https://venturebeat.com/ai/consulting-giant-mckinsey-unveils-its-own-generative-ai-tool-for-employees-lilli/) to access such data and match experts and prior cases to new projects. However, only using retrieval on existing data misses the potential power of a consultant’s internal dataset.

 Let’s take a look at Accenture as one example. My contention would be that with $5mln and 12 months, we could build a business-oriented language model for Accenture that OpenAI could not match in quality. Same applies to any large consultant.

This model could be used internally at Accenture to increase worker efficiency without any worries about data privacy (as Accenture would deploy and control the model). Additionally, weaker (an external model could only see a subset of the extracted training data for privacy reasons) versions of this model could be released to clients in line with the strategy described in the previous section.
Now this is a big claim – OpenAI has the top researchers, engineers and software specialists in the field, spent $200mln on training GPT-4, and have a massive RL dataset that they’ve worked on for years and augmented through the 100s of millions of people using ChatGPT. So how we could possibly believe that we could beat that?

The core idea lies in a twist on the InstructGPT paper, where OpenAI built a model that outperformed far larger and more expensive other models (models 100x bigger). They did this using Reinforcement Learning from Human Feedback or “RLHF”. GPT-4 is extensively trained using RLHF, but Accenture can build an RL dataset for business purposes that OpenAI cannot match.  Consultants in general have built probably the most useful datasets of any company type over time for model improvement.

Where does this magical RL dataset come from? Accenture has been one of the biggest strategy consultants and one of the biggest technology consultants for decades. Over the years, Accenture has solved more business and technology problems than any other company. And certainly far more than the top LLM companies/distributors like OpenAI, Cohere, Anthropic, Microsoft, Amazon etc. A lot of these problems and solutions are stored in Accenture’s data at various levels of abstraction. We can extract and process this data into a problem-solution dataset that is free of overly sensitive client information and then use it to train a reward model, like in InstructGPT, that is uniquely tuned to complex real world business and technology problems. This reward model can then be used to perform RL on a pretrained model.

(Technical Note: the above two paragraphs also apply to non-RL preference methods like DPO or other RL based methods like Constitutional AI. The advantage of Accenture is the private dataset they can build, not the finetuning method used. As a matter of fact, RLHF cannot be used directly as we don’t have direct preferences in the Accenture data and must use extracted or synthetic preferences which is more akin to Constitutional AI.)

Less importantly but still of high potential value, Accenture has probably 100,000 people on staff today who each have some specific business or technology knowledge that isn’t written anywhere on the internet. Not in any dataset that OpenAI could access. Again, you could design a process and software to extract this vast latent expertise at Accenture into a suitable RL dataset to augment what can be obtained from Accenture’s existing data. 

Another example source of RL data would be case style data used in interviews or in business school teaching – you could train a reward model on these cases and use it for model evaluation or further RL training. This data already exists – it just needs to be found, collected and processed into the right format.

Accenture could also drive an RL flywheel by deploying an internal model to its consultants and experts to assist them in their work. It can gather “online” preference data during this deployment by providing multiple answers to queries and asking the user to rank the responses.

On top of this, there’s a lot of paid business text data (e.g. expert calls or consultant research) that is often bought by hedge funds/private equity for investment purposes but that hasn’t been incorporated by model training companies yet. This is less important than the RL data as it can only be used for pretraining, but we could at least augment the best open source models using this data. I’m not convinced yet that specializing the model into a business regime like for example BloombergGPT did is the right approach – something to explore – but definitely there are some very high quality tokens that aren’t being used in pretraining right now.

