<a href="https://colab.research.google.com/github/Alirezaprogramerrd99/MiniGPT-FromScratch-and-LoRA/blob/main/Mini_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip -q install torch --extra-index-url https://download.pytorch.org/whl/cu121
!pip -q install transformers datasets pyyaml

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as ckp
from torch import amp
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
import math
from torch.nn.utils import clip_grad_norm_
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
import os, requests, re, datasets


In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


In [None]:
# from google.colab import files, drive
# import os, io

# # Optionally mount Drive to save checkpoints/output
# drive.mount('/content/drive', force_remount=True)

# os.makedirs("data", exist_ok=True)
# print("Upload a plain-text file (book).")
# up = files.upload()  # choose your .txt

# name = next(iter(up))
# with open("data/raw.txt", "wb") as f:
#     f.write(up[name])
# print("Saved to data/raw.txt, size:", os.path.getsize("data/raw.txt")/1e6, "MB")

In [None]:
import requests, os

os.makedirs("data", exist_ok=True)

url = "https://www.gutenberg.org/cache/epub/11/pg11.txt"  # Alice
r = requests.get(url)


with open("data/raw.txt", "w", encoding="utf-8") as f:
    f.write(r.text)

print("Downloaded Alice in Wonderland:", len(r.text), "chars")

Downloaded Alice in Wonderland: 167674 chars


# Light Cleaning + Chapter Separators (helps the model)


In [None]:
import re, os

with open("data/raw.txt","r",encoding="utf-8", errors="ignore") as f:
    txt = f.read()

# Remove excessive whitespace
txt = re.sub(r'\r', '\n', txt)
txt = re.sub(r'\n{3,}', '\n\n', txt)
txt = re.sub(r'[ \t]+', ' ', txt)

# Insert a separator token between likely chapter boundaries (very heuristic)
SEP = "\n<|endofchapter|>\n"
txt = re.sub(r'\n\s*CHAPTER\s+[\w\dIVXLC]+\b.*\n', lambda m: SEP + m.group(0), txt, flags=re.I)

# Guarantee start & end separators
if not txt.lstrip().startswith("<|endofchapter|>"):
    txt = SEP + txt
if not txt.rstrip().endswith("<|endofchapter|>"):
    txt = txt + SEP

with open("data/raw.cleaned.txt","w",encoding="utf-8") as f:
    f.write(txt)

print("Chars (raw):", len(open("data/raw.txt").read()))
print("Chars (clean):", len(txt))


Chars (raw): 163916
Chars (clean): 163470


# Model: “more sophisticated” but Colab-friendly

Pick one config. `d_model` must be divisible by `n_heads`.

**Medium (T4 friendly):**
- `d_model=384`, `n_heads=6`, `n_layers=8`, `mlp_ratio=4`, `dropout=0.1`, `max_seq_len=256`

**Bigger (A100 ok; may OOM on T4):**
- `d_model=512`, `n_heads=8`, `n_layers=12`, `mlp_ratio=4`, `dropout=0.1`, `max_seq_len=512`

In [None]:
D_MODEL   = 384
N_HEADS   = 6
N_LAYERS  = 8
MLP_RATIO = 4
DROPOUT   = 0.1
BLOCK_SIZE= 256  # keep modest for T4

# Tokenizer + Dataset (single book)

We’ll use the GPT-2 tokenizer and split the single book into train/val by a 95/5 character split.  
We’ll create a **streaming dataset** that returns `(x, y)` windows of size `block_size`.


In [None]:


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --- Tokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
if tok.pad_token is None:
    tok.add_special_tokens({"pad_token": tok.eos_token})  # avoid OOV for pad

with open("data/raw.cleaned.txt","r",encoding="utf-8") as f:
    full_txt = f.read()

pivot = int(0.95 * len(full_txt))
train_txt, val_txt = full_txt[:pivot], full_txt[pivot:]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
def encode(s): return tok.encode(s, add_special_tokens=False)

train_ids = torch.tensor(encode(train_txt), dtype=torch.long)
val_ids   = torch.tensor(encode(val_txt),   dtype=torch.long)

Token indices sequence length is longer than the specified maximum sequence length for this model (46668 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
class TextTokenDataset(Dataset):
    def __init__(self, ids: torch.Tensor, block_size: int):
        assert ids.dim() == 1
        self.ids = ids
        self.block = block_size
    def __len__(self):
        return len(self.ids) - self.block  # correct off-by-one
    def __getitem__(self, idx):
        x = self.ids[idx: idx+self.block]
        y = self.ids[idx+1: idx+1+self.block]
        return x, y

In [None]:
train_ds = TextTokenDataset(train_ids, BLOCK_SIZE)
val_ds   = TextTokenDataset(val_ids,   BLOCK_SIZE)
train_dl = DataLoader(train_ds, batch_size=8, shuffle=True,  drop_last=True)
val_dl   = DataLoader(val_ds,   batch_size=8, shuffle=False, drop_last=False)

print("Vocab:", tok.vocab_size, "| train tokens:", len(train_ids), "| val tokens:", len(val_ids))


# Model (with SDPA attention + checkpointing gated to training)

Key Colab-stability tips:
- Use **PyTorch SDPA** (`F.scaled_dot_product_attention`) with `is_causal=True` → stable + fast.
- Gate **checkpointing** to training only.
- Use **`torch.amp`** device-type API for AMP (no deprecation warnings).

In [None]:
class CausalSelfAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float):
        super().__init__()
        assert d_model % n_heads == 0
        self.nh = n_heads
        self.dk = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3*d_model, bias=False)
        self.proj= nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # x: (B,T,C)
        B,T,C = x.size()
        qkv = self.qkv(x).view(B, T, 3, self.nh, self.dk).permute(2,0,3,1,4)
        q,k,v = qkv[0], qkv[1], qkv[2]  # (B,H,T,d_k)

        y = F.scaled_dot_product_attention(
            q, k, v,
            attn_mask=None,
            dropout_p=self.dropout.p if self.training else 0.0,
            is_causal=True
        )  # (B,H,T,d_k)

        y = y.transpose(1,2).contiguous().view(B,T,C)
        return self.proj(y)

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, mlp_ratio, dropout):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn= CausalSelfAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, mlp_ratio*d_model),
            nn.GELU(),
            nn.Linear(mlp_ratio*d_model, d_model),
            nn.Dropout(dropout),
        )
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


In [None]:
class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, mlp_ratio, max_seq_len, dropout=0.1, weight_tying=True, pad_id=None):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.ModuleList([TransformerBlock(d_model, n_heads, mlp_ratio, dropout) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        if weight_tying:
            self.head.weight = self.tok_emb.weight
        self.max_seq_len = max_seq_len
        self.pad_id = pad_id

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.max_seq_len, f"T={T} > max_seq_len={self.max_seq_len}"
        pos = torch.arange(T, device=idx.device).unsqueeze(0)
        x = self.tok_emb(idx) + self.pos_emb(pos)
        x = self.drop(x)

        use_ckpt = True
        for blk in self.blocks:
            if use_ckpt and self.training and torch.is_grad_enabled():
                x = ckp.checkpoint(blk, x)
            else:
                x = blk(x)

        x = self.ln_f(x)
        logits = self.head(x)
        loss = None
        if targets is not None:
            if self.pad_id is not None:
                loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=self.pad_id)
            else:
                loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

In [None]:
model = GPT(
    vocab_size=tok.vocab_size,
    d_model=D_MODEL, n_layers=N_LAYERS, n_heads=N_HEADS,
    mlp_ratio=MLP_RATIO, max_seq_len=BLOCK_SIZE, dropout=DROPOUT,
    weight_tying=True, pad_id=tok.pad_token_id
).to(device)

sum_params = sum(p.numel() for p in model.parameters())/1e6
print(f"Model params: {sum_params:.2f}M")

Model params: 33.58M


# Training Loop (Colab-safe: AMP, warmup, clipping)



In [None]:
lr = 1.5e-4
weight_decay = 0.1
max_steps = 10_000
log_interval = 200
grad_accum_steps = 8  # use this to enlarge effective batch without OOM

opt = AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
scaler = amp.GradScaler('cuda', enabled=(device.type=="cuda"))

warmup_steps = 2_000
min_lr_ratio = 0.1

In [None]:
def lr_lambda(step):
    if step < warmup_steps:
        return (step + 1) / warmup_steps
    t = (step - warmup_steps) / max(1, (max_steps - warmup_steps))
    # cosine from 1.0 → min_lr_ratio
    return min_lr_ratio + 0.5*(1 - min_lr_ratio)*(1 + math.cos(math.pi*t))


@torch.no_grad()
def evaluate(dl):
    model.eval()
    total, n = 0.0, 0
    with amp.autocast('cuda', dtype=torch.float16, enabled=(device.type=="cuda")):
        for xb, yb in dl:
            xb, yb = xb.to(device), yb.to(device)
            _, loss = model(xb, yb)
            total += float(loss.item()); n += 1
    model.train()
    return total / max(n, 1)

In [None]:
sched = LambdaLR(opt, lr_lambda)

step = 0
model.train()
autocast_ctx = amp.autocast('cuda', dtype=torch.float16, enabled=(device.type=="cuda"))

for epoch in range(9999):
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        with autocast_ctx:
            _, loss = model(xb, yb)
            loss = loss / grad_accum_steps

        scaler.scale(loss).backward()

        if (step + 1) % grad_accum_steps == 0:
            scaler.unscale_(opt)
            clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt)
            scaler.update()
            opt.zero_grad(set_to_none=True)
            sched.step()

        if step % log_interval == 0:
            val_loss = evaluate(val_dl)
            print(f"step {step} | train_loss {(loss.item()*grad_accum_steps):.4f} | val_loss {val_loss:.4f} | lr {sched.get_last_lr()[0]:.2e}")

        step += 1
        if step >= max_steps:
            break
    if step >= max_steps:
        break

# Generation (with repetition penalty to reduce loops)

In [None]:
import torch

def generate(model, tokenizer, prompt: str, max_new_tokens=120, top_k=50, top_p=0.9, temperature=0.9, repetition_penalty=1.2, device=None):
    model.eval()
    device = device or next(model.parameters()).device
    ids = tokenizer.encode(prompt, add_special_tokens=False)
    x = torch.tensor(ids, dtype=torch.long, device=device)[None, :]

    with torch.no_grad(), amp.autocast('cuda', dtype=torch.float16, enabled=(device.type=="cuda")):
        for _ in range(max_new_tokens):
            if x.size(1) > model.max_seq_len:
                x = x[:, -model.max_seq_len:]
            logits, _ = model(x)
            logits = logits[:, -1, :]

            # repetition penalty
            if repetition_penalty and x.size(1) > 1:
                uniq = torch.unique(x)
                logits[..., uniq] /= repetition_penalty

            logits = logits / max(temperature, 1e-6)
            probs = torch.softmax(logits, dim=-1)

            # top-k
            if top_k > 0:
                v, ix = torch.topk(probs, top_k)
                mask = torch.ones_like(probs, dtype=torch.bool)
                mask.scatter_(1, ix, False)
                probs = probs.masked_fill(mask, 0)
                probs = probs / probs.sum(dim=-1, keepdim=True)

            # nucleus
            if top_p < 1.0:
                sorted_probs, sorted_idx = torch.sort(probs, descending=True)
                cumsum = torch.cumsum(sorted_probs, dim=-1)
                mask = cumsum > top_p
                mask[..., 0] = False
                sorted_probs[mask] = 0
                sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
                next_id = torch.multinomial(sorted_probs, num_samples=1)
                next_token = sorted_idx.gather(-1, next_id)
            else:
                next_token = torch.multinomial(probs, num_samples=1)

            x = torch.cat([x, next_token], dim=1)

    return tokenizer.decode(x[0].tolist())

print(generate(model, tok, "CHAPTER I. ", max_new_tokens=200, top_k=50, top_p=0.9, temperature=0.8))


# Pre-tranining and Fine-tunning (On CPU)

In [None]:
!pip -q install transformers datasets peft accelerate evaluate

device = torch.device("cpu")  # explicit CPU
torch.set_num_threads(2)      # keep Colab snappy; bump if you like
print("Torch:", torch.__version__, "| device:", device)


Torch: 2.8.0+cu126 | device: cpu


In [None]:
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_dict({"text":[train_txt]}),
    "validation": datasets.Dataset.from_dict({"text":[val_txt]}),
})
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

In [None]:
model_name = "distilgpt2"  # small & fast on CPU

tok = AutoTokenizer.from_pretrained(model_name)
if tok.pad_token is None:
    tok.add_special_tokens({"pad_token": tok.eos_token})

BLOCK_SIZE = 256  # keep modest on CPU

def tokenize(examples):
    # Add attention_mask to the output
    return tok(examples["text"], add_special_tokens=False, return_attention_mask=True)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

# group texts into contiguous blocks of BLOCK_SIZE
def group_texts(examples):
    # Process both input_ids and attention_mask
    ids = sum(examples["input_ids"], [])
    masks = sum(examples["attention_mask"], [])
    total_len = (len(ids) // BLOCK_SIZE) * BLOCK_SIZE
    ids = ids[:total_len]
    masks = masks[:total_len]
    chunks = [ids[i:i+BLOCK_SIZE] for i in range(0, total_len, BLOCK_SIZE)]
    attention_masks = [masks[i:i+BLOCK_SIZE] for i in range(0, total_len, BLOCK_SIZE)]
    labels = [c[1:]+[tok.eos_token_id] for c in chunks]
    return {"input_ids": chunks, "attention_mask": attention_masks, "labels": labels}

lm_ds = tokenized.map(group_texts, batched=True)
lm_ds

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (46668 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 182
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7
    })
})

In [None]:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tok))  # if pad token added

# LoRA config: tune only a few million params instead of full model
lora_cfg = LoraConfig(
    r=8,                # rank
    lora_alpha=16,      # scaling
    lora_dropout=0.05,  # regularization
    target_modules=["c_attn","q_attn","k_attn","v_attn","c_proj"],  # GPT2-style modules
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

data_collator = DataCollatorForLanguageModeling(tok, mlm=False)

args = TrainingArguments(
    output_dir="lora_out",
    per_device_train_batch_size=1,      # tiny batch on CPU
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=32,     # simulate larger batch
    num_train_epochs=4,                 # or use max_steps if preferred
    eval_steps=200,
    logging_steps=100,
    save_steps=500,
    learning_rate=1e-4,                 # conservative for CPU fine-tune
    weight_decay=0.1,
    warmup_steps=300,
    lr_scheduler_type="cosine",
    no_cuda=True,
    bf16=False, fp16=False,             # CPU only
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=lm_ds["train"],
    eval_dataset=lm_ds["validation"],
    data_collator=data_collator,
)

trainer.train()

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



trainable params: 405,504 || all params: 82,318,080 || trainable%: 0.4926


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=24, training_loss=3.583665211995443, metrics={'train_runtime': 1681.8123, 'train_samples_per_second': 0.433, 'train_steps_per_second': 0.014, 'total_flos': 48009446424576.0, 'train_loss': 3.583665211995443, 'epoch': 4.0})

In [None]:
from transformers import pipeline, AutoModelForCausalLM
from peft import PeftModel

# Load the base model
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)

# Load the PEFT adapters and merge them
model = PeftModel.from_pretrained(model, "/content/lora_out/checkpoint-24", local_files_only=True)
model = model.merge_and_unload()

model.eval()

# Dynamic quantization for CPU inference speed:
model_q = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)



`torch_dtype` is deprecated! Use `dtype` instead!
For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  model_q = torch.quantization.quantize_dynamic(


In [None]:
pipe = pipeline("text-generation", model=model_q, tokenizer=tok, device=-1)
print(pipe("CHAPTER I. ", max_new_tokens=120, do_sample=True, top_k=50, top_p=0.9, temperature=0.8, repetition_penalty=1.2)[0]["generated_text"])

Device set to use cpu


CHAPTER I. ills
(A) The U.S. is the only two "pagan" state that not all human beings in our culture (a total of to be exact), or at least, it and this was as follows: [I] don’t think- a term such as names have an impact on where who we are because you can come across for being just after his name; what about more serious problems? People with no common cause will still also work out without them by which time many other things change based upon history.[4][5]" It's right—in part due


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL = "distilgpt2"   # or "gpt2" if that’s what you fine-tuned from
LORA_DIR   = "/content/lora_out/checkpoint-24"     # your PEFT save dir

# 1) tokenizer
tok = AutoTokenizer.from_pretrained(BASE_MODEL)
if tok.pad_token is None:
    tok.add_special_tokens({"pad_token": tok.eos_token})

# 2) load base
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.float32)
base.resize_token_embeddings(len(tok))

# 3) attach LoRA adapters
model = PeftModel.from_pretrained(base, LORA_DIR)

# 4) MERGE adapters into the base weights for plain HF inference
model = model.merge_and_unload()   # <- important!
model.eval()

# 5) (Optional) dynamic quantization for faster CPU inference
model_q = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)


For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  model_q = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)


In [None]:
def generate_text(model, tok, prompt,
                  max_new_tokens=120,
                  temperature=0.8, top_k=40, top_p=0.9,
                  repetition_penalty=1.2, no_repeat_ngram_size=3,
                  num_return_sequences=1):
    input_ids = tok(prompt, return_tensors="pt").input_ids
    out = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        no_repeat_ngram_size=no_repeat_ngram_size,
        pad_token_id=tok.eos_token_id,
        num_return_sequences=num_return_sequences,
    )
    return [tok.decode(o, skip_special_tokens=True) for o in out]

samples = generate_text(model_q, tok, "CHAPTER I. ", max_new_tokens=120)
print(samples[0])


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


CHAPTER I. __________
I, in that most and all things on the way to "1st!" which at first we just so happened upon a great-great general being among my friends of many times it seems from an ordinary point of view no more than as if such people were only out there making their day's work through down streets or with them by long distances not getting very for themselves (of any character our time here was over like this one; both will you be next: what have made other men some difficult) even without power - before two would come they full strength against man up front?


In [None]:
from transformers import pipeline
pipe = pipeline("text-generation", model=model_q, tokenizer=tok, device=-1)
print(pipe("Chapter 1. I was reading my ", max_new_tokens=120, do_sample=True,
           temperature=0.8, top_k=40, top_p=0.9,
           repetition_penalty=1.2, no_repeat_ngram_size=3)[0]["generated_text"])

Device set to use cpu


Chapter 1. I was reading my vernacular, and the other two of them just said they (snowed) "wish to help you in an hour or more while on a call."
