# Looking Inside Large Language Models

## Initial Setup

In [1]:
# Initial imports
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

2025-12-15 17:30:00.137398: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-15 17:30:00.141169: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-15 17:30:00.152885: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765819800.172696     148 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765819800.178683     148 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765819800.193747     148 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

## Loading the LLM

In [2]:
device: str = "cpu"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path="microsoft/Phi-3-mini-4k-instruct"
)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="microsoft/Phi-3-mini-4k-instruct",
    device_map=device,
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=250,
    do_sample=False,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu


## An Overview of Transformer Models

### The Inputs and Outputs of a Trained Transformer LLM

In [3]:
prompt: str = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
output = generator(prompt, max_new_tokens=50)

print(output[0]["generated_text"])

 Mention the steps you're taking to prevent it in the future.

Email:

Subject: Sincere Apologies for the Gardening Mishap

Dear Sarah,

I hope this email finds you well


In [4]:
%%time

prompt: str = "Please, can you give me a brief overview of life and works of Gary Marcus in the Field of AI? Give me a list of bullet points of his main ideas."
output = generator(prompt, max_new_tokens=500)

print(output[0]["generated_text"])



**Solution 1:**

- Gary Marcus is a cognitive scientist and psychologist known for his work in the field of artificial intelligence (AI).
- He has been a professor at New York University and is currently a researcher at Google Brain.
- Marcus has written several books on AI, including "Guitar Zero," "The Baby Code," and "Kluge: The Haphazard Construction of the Human Mind."
- He has been a vocal critic of deep learning and has advocated for a more balanced approach to AI research that includes symbolic reasoning and other cognitive processes.
- Marcus has also been involved in the development of AI technologies, such as the AI startup Geometric Intelligence and the AI startup Robust.AI.

**Main ideas of Gary Marcus:**

- AI research should focus on developing systems that can reason and understand the world in a more human-like way.
- Deep learning has limitations and should be combined with other approaches to create more robust AI systems.
- AI should be developed with ethical cons

### The Components of the Forward Pass

In [5]:
# Print model architecture
model

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
 

### Choosing a Single Token from the Probability Distribution (Sampling / Decoding)

In [6]:
prompt = "The capital of Brazil is"

# Tokenize the input prompt
input_ids: torch.Tensor = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids: torch.Tensor = input_ids.to(device)

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [7]:
print(f">>> Shape of model output before classification head: {model_output[0].shape}")
print(f">>> Shape of classification head output:              {lm_head_output.shape}")

>>> Shape of model output before classification head: torch.Size([1, 5, 3072])
>>> Shape of classification head output:              torch.Size([1, 5, 32064])


In [8]:
# Test decoding the last token. 
# Greedy decoding: grab the token with the highest logit value
tokenizer.decode(lm_head_output[0, -1].argmax(-1))

'Bras'

In [9]:
# Decode the topk tokens ang get thei logit values
topk: int = 10
topk_logits, topk_indices = torch.topk(lm_head_output[0, -1], k=topk)
for logit, index in zip(topk_logits, topk_indices):
    token = tokenizer.decode(index)
    print(f">>> Token: {token:15} | Logit: {logit.item():.4f}")

>>> Token: Bras            | Logit: 43.0000
>>> Token: Brasil          | Logit: 42.5000
>>> Token: Rio             | Logit: 41.5000
>>> Token: _               | Logit: 40.0000
>>> Token: a               | Logit: 39.7500
>>> Token: not             | Logit: 39.5000
>>> Token: São             | Logit: 39.2500
>>> Token: ...             | Logit: 39.0000
>>> Token: located         | Logit: 39.0000
>>> Token: the             | Logit: 39.0000


### Parallel Token Processing and Context Size

### Speeding Up Generation by Caching Keys and Values

In [10]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(device)

In [11]:
%%timeit -n 1

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


5.39 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%%timeit -n 1

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20,
  use_cache=False
)

59.7 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Inside the Transformer Block