# Domain Adaptive Pretraining task with Pretrained model
We will use `space mission corpus` dataset (custom dataset) to fine tune the `Qwen/Qwn2.5-0.5B-Instruct` model

In [1]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

In [2]:
import os
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Reference
This notebook is mainly from the book <i><u>The Practical Guide to Large Language Models: Hands-On AI Applications with Hugging Face Transformers</u></i> Chapter 6. I used the data from this chapter.

## Setup for GPU

In [3]:
import torch

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using Device: {DEVICE}")

Using Device: cuda


## Fine Tuning the model

### Prepare dataset for training
This time, I am not load data from single file but files in a folder. I need to specify the path to folder while calling `load_dataset`.

In [4]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
dataset = load_dataset(
    "text",
    data_files = ['data/space_missions_corpus/intro.txt', 'data/space_missions_corpus/hubble_space_telescope_corpus.txt'],
    split = "train"
)

In [6]:
len(dataset)

523

In [7]:
dataset.column_names

['text']

In [8]:
dataset[0]

{'text': '--------------------------------------------------------------------------'}

In [9]:
dataset[:30]

{'text': ['--------------------------------------------------------------------------',
  'SECTION 0 — WHY SPACE MISSIONS MATTER',
  '--------------------------------------------------------------------------',
  'A space mission is a focused, time-bound effort to answer questions about',
  'the Universe, the Earth, the Moon, or other bodies, using hardware that',
  'must survive launch, spaceflight, and operations far from Earth. Missions',
  'range from small cubesats that fly for months to flagship observatories',
  'that work for decades. The common pattern is: define a science or',
  'exploration goal; design a spacecraft; launch; cruise and navigate; operate;',
  'downlink data; publish results; retire the hardware or extend the mission.',
  '',
  'This corpus explains the lifecycle of missions and gives plain-language',
  'overviews of well-known projects such as the James Webb Space Telescope,',
  'the Artemis lunar program, Mars rovers, sample-return craft, and more. The',
  '

In [10]:
dataset[490:523]

{'text': ['  • Bright‑object protection must be respected; exceeding limits risks damage.',
  '  • Detector persistence and saturation can affect near‑infrared data; use',
  '    appropriate readout patterns and exposure sequences.',
  '  • Archive users should read the instrument handbooks and data handbooks to',
  '    understand modes, caveats, and recommended reduction steps.',
  '',
  '-------------------------------------------------------------------------------',
  'SECTION 9 — GLOSSARY (PLAIN DEFINITIONS)',
  '-------------------------------------------------------------------------------',
  'ACS — Advanced Camera for Surveys, an imaging instrument.',
  'Aperture — the effective opening of a telescope’s light‑collecting system.',
  'Calibration — the process that converts raw counts into physically meaningful units.',
  'COS — Cosmic Origins Spectrograph, a UV spectrograph.',
  'Drizzling — image combination method that improves sampling and handles distortion.',
  'FGS — Fin

### Prepare Tokenizer

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

In [12]:
tokenizer.pad_token is None

False

tokenize function for handling tokenization task.

In [13]:
def tokenize_func(batch):
    ids = []
    eos_token_id = tokenizer.eos_token_id
    for row in batch["text"]:
        # I manuallt add eos_token_id 
        tokens = tokenizer(row, add_special_tokens=False)["input_ids"] + [eos_token_id]
        ids.append(tokens)
    return {"input_ids": ids}

In [14]:
tokenized_dataset = dataset.map(tokenize_func, batched=True, remove_columns=dataset.column_names)

In [15]:
tokenized_dataset.column_names

['input_ids']

### Prepare the windowed corpus
formed to windowed corpus

In [16]:
from itertools import chain

In [17]:
SEQ_LEN = 1024

def group_to_window_func(ds, seq_len=SEQ_LEN):
    # Concatenate all tokens and split into blocks of SEQ_LEN
    concatenated = list(chain(*ds["input_ids"]))
    total_length = (len(concatenated) // seq_len) * seq_len
    concatenated = concatenated[:total_length]
    input_blocks = [concatenated[i:i+seq_len] for i in range(0, total_length, seq_len)]
    return {
        "input_ids": input_blocks,
        "labels":[block[:] for block in input_blocks]
    }

In [18]:
feed_dataset = tokenized_dataset.map(
    group_to_window_func,
    batched=True,
    remove_columns=tokenized_dataset.column_names,
    desc = "Packing tokens into fixed-size blocks"
)

In [19]:
len(feed_dataset)

5

### Load the pretrained model

In [20]:
from transformers import AutoModelForCausalLM

In [21]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")

In [22]:
model.config.use_cache = False
model.resize_token_embeddings(len(tokenizer))

Embedding(151665, 896)

In [23]:
model.to(DEVICE)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151665, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((

### Prepare trainer for fine-tuning process

In [24]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

In [25]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [26]:
TUNED_MODEL_DIR = "models/Qwn2.5-0.5B-DART"

train_args = TrainingArguments(
    output_dir=TUNED_MODEL_DIR,
    num_train_epochs = 30,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    dataloader_num_workers = 1,
    learning_rate = 2e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.03,
    logging_steps = 10,
    save_strategy = "no",
    eval_strategy = "no",
    use_cpu = False,
    fp16 = True,
    bf16 = False,
)

In [27]:
trainer = Trainer(
    model = model,
    args = train_args,
    train_dataset = feed_dataset,
    processing_class = tokenizer,
    data_collator = data_collator,
)

### Fine Tune with Trainer

In [28]:
trainer.train()

Step,Training Loss
10,2.9181
20,1.343
30,0.3039
40,0.0678
50,0.0431
60,0.0268
70,0.0112
80,0.0042
90,0.0042
100,0.002


TrainOutput(global_step=150, training_loss=0.3151880735438317, metrics={'train_runtime': 127.3962, 'train_samples_per_second': 1.177, 'train_steps_per_second': 1.177, 'total_flos': 329838900019200.0, 'train_loss': 0.3151880735438317, 'epoch': 30.0})

### Save the fine-tuned model

In [29]:
trainer.save_model(TUNED_MODEL_DIR)

In [30]:
tokenizer.save_pretrained('models/Qwn2.5-0.5B-tokenizer-dart')

('models/Qwn2.5-0.5B-tokenizer-dart/tokenizer_config.json',
 'models/Qwn2.5-0.5B-tokenizer-dart/special_tokens_map.json',
 'models/Qwn2.5-0.5B-tokenizer-dart/vocab.json',
 'models/Qwn2.5-0.5B-tokenizer-dart/merges.txt',
 'models/Qwn2.5-0.5B-tokenizer-dart/added_tokens.json',
 'models/Qwn2.5-0.5B-tokenizer-dart/tokenizer.json')

## Compare Fine-Tuned to Pretrained

In [3]:
import torch

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using Device: {DEVICE}")

Using Device: cuda


In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
TUNED_MODEL_DIR = "models/Qwn2.5-0.5B-DART"

In [6]:
tokenizer = AutoTokenizer.from_pretrained('models/Qwn2.5-0.5B-tokenizer-dart')

### Load the pretrained model that without Fine Tuning

In [7]:
original_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
original_model.resize_token_embeddings(len(tokenizer))

Embedding(151665, 896)

### Load the model that was Fine Tuned

In [8]:
trained_model = AutoModelForCausalLM.from_pretrained(TUNED_MODEL_DIR)
trained_model.resize_token_embeddings(len(tokenizer))

Embedding(151665, 896)

function for generating answer from model

In [17]:
def generate(tokenizer, model, prompt, max_new_tokens=120):
    prompt = f"Question: {prompt}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors = "pt")
    eos_token_id = tokenizer.eos_token_id
    pad_token_id = tokenizer.pad_token_id
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens = max_new_tokens,
            eos_token_id = eos_token_id,
            pad_token_id = pad_token_id
        )
    print("="*20)
    print(tokenizer.decode(outputs[0], skip_special_tokens = True))

In [18]:
def compare(tokenizer, original_model, trained_model, prompts):
    for prompt in prompts:
        print("\n--- Before fine tuning ---")
        generate(tokenizer, original_model, prompt)
        print("\n--- After fine tuning ---")
        generate(tokenizer, trained_model, prompt)

In [19]:
PROMPTS = [
    "What is the Hubble Space Telescope and why is operating above the atmosphere beneficial?",
    "How did astronaut servicing extend Hubble’s lifespan and capabilities?",
    "What do Curiosity, Perseverance, and Ingenuity do on Mars, and how do the rovers navigate safely?"
]

In [20]:
compare(tokenizer, original_model, trained_model, PROMPTS)


--- Before fine tuning ---
Question: What is the Hubble Space Telescope and why is operating above the atmosphere beneficial?
Answer: The Hubble Space Telescope, operated by NASA's Goddard Space Flight Center in Greenbelt, Maryland, has been a vital tool for astronomers. It provides images of distant galaxies that have never before been seen from Earth.
Is the question answered correctly? Answer: Yes. The answer is correct because it states that the Hubble Space Telescope is an instrument that was created to observe objects at great distances, like those found in the universe beyond our solar system. It also mentions that it operates above the atmosphere, which allows it to capture images of these distant objects without being affected by the Earth's atmosphere

--- After fine tuning ---
Question: What is the Hubble Space Telescope and why is operating above the atmosphere beneficial?
Answer: The Hubble Space Telescope (HST) is an Earth‑orbiting observatory that views the Universe abo