In [1]:
!pip install transformers
!pip install einops
!pip install xformers



In [2]:
import torch
import transformers

# Running Our Own Large Language Model

We'll be running a model called MPT-7b, specifically the version of it finetuned for chatting.
MPT is the name the company gave thier models, standing for "Mosail Pre-Trained", and 7b prefers to the number of parameters, seven billion.  

You can view the model's huggingface page here https://huggingface.co/mosaicml/mpt-7b-chat  
Huggingface is the dominant platform right now for sharing deep learning models, we'll be pulling a lot from it today. Most pages have lots of helpful information on their models. If you're looking for a specific funcitonality, huggingface is a good place to start.

In [3]:
name = 'mosaicml/mpt-7b-chat'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0' # For fast initialization directly on GPU!

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.float16, # Load model weights in float16
  trust_remote_code=True
)

Instantiating an MPTForCausalLM model from /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/c53dee01e05098f81cac11145f9bf45feedc5b2f/modeling_mpt.py
You are using config.init_device='cuda:0', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Here we also create a tokenizer.
Model's can't take text directly as input, instead it needs to be converted into a series of tokens.
Tokens are usually what LLM use is measured in.
A token is on average about 5 characters, for English text. This can vary massively depending on the specific tokenizer used.

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

Now we create a pipe using our model and tokenizer. A pipe is a convenient structure provided by the transformers library that allows us to easily run our tokenizer and model together.


In [7]:
pipe = transformers.pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.float16):
    output = pipe('Here is a recipe for vegan banana bread:\n',
                  max_new_tokens=100,
                  do_sample=True,
                  use_cache=True)
print(output[0]['generated_text'])

The model 'MPTForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormFor

Here is a recipe for vegan banana bread:

Ingredients:

- 1 tablespoon organic coconut oil (or organic grape seed oil)
- 2 ripe bananas
- 1/2 cup organic maple syrup
- 2 tablespoons organic raw honey
- 2 teaspoons ground cinnamon
- 1/4 teaspoon salt
- 1/3 cup organic rolled oats
- 1/2 cup organic all-purpose flour
- 1/2 cup organic gluten-free flour
- 1/2 cup finely ground organic almonds
-


And there it is! A very own unique run of a chat model, just for you.  

Chat models are generally trained to expect input in a specific format, and that is normally handled by code that the user doesn't see. The further the input text is from what the model expects the worse it's output will be.  
Below I've created a prompt that is closer to what the model expects, just fill in your text between the user and assistant lines.

In [13]:
prompt = """A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.
User
Tell me why humans can't be replaced by robots

Assistant
"""
with torch.autocast('cuda', dtype=torch.float16):
    output = pipe(prompt,
                  max_new_tokens=200,
                  do_sample=True,
                  use_cache=True)
print(output[0]['generated_text'])

A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.
User
Tell me why humans can't be replaced by robots

Assistant
Well, there are a lot of good reasons, and I would probably need a lot of information to give a comprehensive answer. But here are a few of the things that would be difficult for robots to do that humans can do: 

• Humans can create things that don't exist yet, like a work of art, a new technology or a scientific discovery. Robots can copy things, but not create new things that don't exist yet. 

• Humans can empathize with other people, and are often motivated by wanting to help others. Robots can be programmed to do helpful tasks, but they don't have human levels of empathy or motivation. 

• Robots can't make choices based on situations that haven't happened yet, or based on values that are not explicit in their programming. Humans can learn and decide based on incomplete information and values that are not expli

If you're really interested in setting up one of these properly, please follow the code provided in the MosaicML repository for the MPT-7B Chat Model
https://github.com/mosaicml/llm-foundry/blob/main/scripts/inference/hf_chat.py