<a href="https://colab.research.google.com/github/Ashish265/hands_on_Large_Language_models/blob/main/Chapter_3_Looking_Inside_Transformer_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-4k-instruct')

model = AutoModelForCausalLM.from_pretrained(
    'microsoft/Phi-3-mini-4k-instruct',
    device_map='cuda',
    torch_dtype='auto',
    trust_remote_code=False,
)

generator = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=100,
    do_sample=False
)

# The Inputs and Outputs of a Trained Transformer LLM

In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my sincerest apologies for the unfortunate incident that occurred in your garden. It was a tragic mishap that I deeply regret.

The incident happened when I was attempting to help you with your gardening project. Unfortunately, in my eagerness to assist, I accidentally knock


In [4]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_featur

# Choosing a single token from the probability distribution (sampling / decoding)

In [16]:
prompt = "The capital of France is"

input_ids = tokenizer(prompt, return_tensors= 'pt').input_ids

input_ids = input_ids.to("cuda")

model_output = model.model(input_ids)

lm_head_output = model.lm_head(model_output[0])

print(lm_head_output)

tensor([[[24.7500, 24.8750, 22.7500,  ..., 19.0000, 19.0000, 19.0000],
         [31.0000, 31.5000, 26.0000,  ..., 25.8750, 25.8750, 25.8750],
         [31.3750, 28.8750, 31.0000,  ..., 26.2500, 26.2500, 26.2500],
         [33.0000, 31.8750, 36.0000,  ..., 27.7500, 27.7500, 27.7500],
         [27.8750, 29.5000, 28.0000,  ..., 20.3750, 20.3750, 20.3750]]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)


In [17]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Paris'

In [21]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [22]:
lm_head_output.shape

torch.Size([1, 5, 32064])

# Speeding up generation by caching keys and values

In [23]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

In [24]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


5.41 s ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

30.1 s ± 676 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
