<p style="background-color:#fff1d7; padding:15px; "> <b>FYI: </b> The transformers library has two types of model classes: <code> AutoModelForCausalLM </code> and <code>AutoModelForMaskedLM</code>. Causal language models represent the decoder-only models that are used for text generation. They are described as causal, because to predict the next token, the model can only attend to the preceding left tokens. Masked language models represent the encoder-only models that are used for rich text representation. They are described as masked, because they are trained to predict a masked or hidden token in a sequence.</p>

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, AutoModelForCausalLM, pipeline

In [None]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False, # False means to not include the prompt text in the returned text
    max_new_tokens=50,
    do_sample=False, # no randomness in the generated text
)

Device set to use cpu


In [None]:
prompt = "Write an email apologizing to Rohan for the tragic gardening mishap. Explain how it happened. "

output = generator(prompt)

print(output[0]['generated_text'])

KeyboardInterrupt: 

In [None]:
model


Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

In [None]:
model.model.embed_tokens

Embedding(32064, 3072, padding_idx=32000)

In [None]:
model.model

Phi3Model(
  (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-31): 32 x Phi3DecoderLayer(
      (self_attn): Phi3Attention(
        (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
        (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        (rotary_emb): Phi3RotaryEmbedding()
      )
      (mlp): Phi3MLP(
        (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
        (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
        (activation_fn): SiLU()
      )
      (input_layernorm): Phi3RMSNorm()
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      (post_attention_layernorm): Phi3RMSNorm()
    )
  )
  (norm): Phi3RMSNorm()
)

In [None]:
model.model.layers[0]

Phi3DecoderLayer(
  (self_attn): Phi3Attention(
    (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
    (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (mlp): Phi3MLP(
    (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
    (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
    (activation_fn): SiLU()
  )
  (input_layernorm): Phi3RMSNorm()
  (resid_attn_dropout): Dropout(p=0.0, inplace=False)
  (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
  (post_attention_layernorm): Phi3RMSNorm()
)

In [None]:
prompt = "The capital of India is"

In [None]:
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

tensor([[ 450, 7483,  310, 7513,  338]])

In [None]:
# Get the output of the model before the lm_head
model_output = model.model(input_ids)

In [None]:
model_output

BaseModelOutputWithPast(last_hidden_state=tensor([[[-0.3184,  1.2031,  0.3203,  ..., -0.2988,  0.6719,  0.1211],
         [-0.1338,  0.3281,  0.3848,  ...,  0.5547, -0.1406, -0.6055],
         [-0.5859,  1.0859,  1.5391,  ..., -0.4141,  0.2773,  0.4062],
         [-0.6406,  0.8516,  0.3633,  ...,  0.1992, -0.0195,  0.0140],
         [-0.7578,  1.2500,  0.9727,  ...,  1.3203, -0.0503, -0.5117]]],
       dtype=torch.bfloat16, grad_fn=<MulBackward0>), past_key_values=((tensor([[[[-2.9688e-01,  1.3770e-01, -7.8125e-02,  ..., -9.1797e-02,
           -2.6123e-02,  5.8350e-02],
          [ 8.6914e-02,  9.7656e-02, -1.3672e-01,  ..., -2.3926e-02,
           -1.0840e-01, -2.7100e-02],
          [ 4.0820e-01,  4.1504e-02, -1.9141e-01,  ..., -1.5137e-01,
           -2.1094e-01,  8.7891e-02],
          [ 2.4023e-01,  1.4648e-03, -7.1289e-02,  ..., -4.6143e-02,
            6.8359e-02,  4.7852e-02],
          [ 9.8633e-02, -9.8633e-02, -7.1289e-02,  ..., -1.9336e-01,
            2.3047e-01,  2.2852e

In [None]:
# Get the shape the output the model before the lm_head
model_output[0].shape

torch.Size([1, 5, 3072])

In [None]:
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [None]:
lm_head_output

tensor([[[25.1250, 25.2500, 23.2500,  ..., 19.3750, 19.3750, 19.3750],
         [31.0000, 31.5000, 26.0000,  ..., 25.8750, 25.8750, 25.8750],
         [31.3750, 28.7500, 31.0000,  ..., 26.1250, 26.1250, 26.1250],
         [33.5000, 34.0000, 37.5000,  ..., 28.8750, 28.8750, 28.8750],
         [33.2500, 33.0000, 32.2500,  ..., 25.3750, 25.3750, 25.3750]]],
       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)

In [None]:
lm_head_output.shape

torch.Size([1, 5, 32064])

The LM head outputs for each token in the input prompt, a vector of size 32064 (vocabulary size). So there are 5 vectors, each of size 32064. Each vector can be mapped to a probability distribution, that shows the probability for each token in the vocabulary to come after the given token in the input prompt.

Since we're interested in generating the output token that comes after the last token in the input prompt ("is"), we'll focus on the last vector. So in the next cell, `lm_head_output[0,-1]` is a vector of size 32064 from which you can generate the token that comes after ("is"). You can do that by finding the id of the token that corresponds to the highest value in the vector `lm_head_output[0,-1]` (using `argmax(-1)`, -1 means across the last axis here).

In [None]:
token_id = lm_head_output[0,-1].argmax(-1)
token_id

tensor(903)

In [None]:
tokenizer.decode(token_id)

'_'