一、环境准备

TinyLlama 是一个 1.1B 参数的轻量化 Llama 变体，可以在消费级 GPU（如 RTX 3060 / A100）上运行。官方模型来自 TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
。

安装依赖：


In [None]:
conda create -n tinyllama python=3.10 -y
conda activate tinyllama

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda install -c nvidia cuda-runtime=12.1

pip install transformers==4.43.3 accelerate sentencepiece bitsandbytes

pip install datasets tqdm

pip install torchprofile thop

pip install ipywidgets

jupyter nbextension enable --py widgetsnbextension

pip install wandb


检查 torch 是否连接到 cuda

In [5]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.cuda.get_device_name(0))

2.5.1+cu121
True
12.1
NVIDIA GeForce RTX 4070 Laptop GPU


二、模型加载与推理

使用 Hugging Face 的 transformers API 直接运行推理：


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Tinyllama/Tinyllama-1.1B-intermediate-step-1431k-3T"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

prompt = "Explain the concept of adaptive computation time in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Some parameters are on the meta device because they were offloaded to the cpu.


Explain the concept of adaptive computation time in simple terms.
Explain the concept of adaptive computation time in simple terms.
Adaptive computation time is the time required to compute a function using an adaptive algorithm.
When a function is given in an adaptive algorithm, it is computed using a number of passes.
Adaptive computation time is a measure of the computational time of an algorithm.
When a function is given


三、性能分析（FLOPs 与推理时间）

可以测量单次推理的延迟与大致计算量：


In [6]:
import torch    
import time
from thop import profile

prompt = "What is adaptive reasoning in neural networks?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# 计时
start = time.time()
_ = model.generate(**inputs, max_new_tokens=64)
torch.cuda.synchronize()
print(f"Inference time: {time.time() - start:.3f} sec")

# FLOPs (仅前向)
flops, params = profile(model, inputs=(inputs["input_ids"],))
print(f"FLOPs: {flops/1e9:.2f} GFLOPs | Params: {params/1e6:.2f} M")

Inference time: 23.270 sec
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.
FLOPs: 10.34 GFLOPs | Params: 1034.42 M


注：thop 的结果是粗略估计，仅供比较不同配置时使用。

四、量化与加速

TinyLlama 支持 8-bit / 4-bit 加载来减少显存占用：

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                  # 启用 4-bit 量化
    bnb_4bit_use_double_quant=True,     # 双重量化（显存更省）    
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,     # 指定量化配置
    device_map="auto"
)


Unused kwargs: ['device_map']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


KeyboardInterrupt: 

In [28]:
# 使用和之前一样的 inputs
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,          # 明确启用采样
    temperature=0.7,         # 控制随机性：越低越可预测，越高越有创造力
    top_p=0.9,               # Top-p 采样：只考虑累积概率最高的词元
    repetition_penalty=1.2   # 对重复出现的词元施加惩罚
)

# 建议使用 skip_special_tokens 来获得更干净的输出
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is the capital of France?
The capital city of France is Paris. The official language in France is French, which is also the official language in most other countries of Europe. It is not surprising that a large number of people speak English fluently or are fluent in this


五、实验记录与可视化建议

使用 Weights & Biases (wandb) 记录每层时间分布：

In [29]:
import wandb
wandb.init(project="tinyllama-replication")

然后在推理循环中添加：

In [30]:
wandb.log({"inference_time": elapsed, "tokens_generated": n_tokens})


NameError: name 'elapsed' is not defined

未来阶段（置信度控制）可以在输出概率上计算 Softmax 熵：

In [31]:
import torch.nn.functional as F
logits = model(**inputs).logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
entropy = -torch.sum(probs * torch.log(probs + 1e-8))
