# 🧠 Continual Pretraining of Mistral-7B on Telecom Domain Text

This notebook documents the **continual pretraining** of the [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) language model on our **domain-specific telecom dataset (text file)** derived from **cleaned ITU (International Telecommunication Union) standards documents**.

The goal is to adapt Mistral-7B to the **telecommunication domain**, allowing the model to better understand specialized terms, formal structure, and context found in ITU regulatory and technical documentation. This step is a **precursor to instruction fine-tuning**, where the model will be trained to follow instructions in a task-specific manner using prompts and responses.

---

### 📌 Project Highlights

- **Dataset**: Cleaned telecom-focused text corpus compiled from ITU standards documents.
- **Base Model**: [`mistralai/Mistral-7B-v0.3`](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- **Approach**: Continual pretraining using Hugging Face's `transformers` library.
- **Purpose**: Domain adaptation before applying instruction fine-tuning.
- **Expected Outcome**: Improved performance on telecom-related tasks post-finetuning.

---


## 📦 Install Required Libraries

This section installs the necessary Python packages for continual pretraining of the Mistral-7B model:

- `peft`: For Parameter-Efficient Fine-Tuning techniques like LoRA.
- `accelerate`: From Hugging Face; manages multi-GPU or TPU training.
- `bitsandbytes`: Enables 8-bit and 4-bit quantization for memory-efficient training.
- `transformers`: Core library for working with pretrained language models.
- `datasets`: For loading and managing large-scale datasets.
- `GPUtil`: Helps in querying GPU availability and memory.


In [1]:
%%capture
# Install the required libraries silently (%%capture suppresses the output)

!pip install peft            # For parameter-efficient fine-tuning (e.g., LoRA, QLoRA)
!pip install accelerate      # accelerate fro managing multi-GPU or TPU training.
!pip install bitsandBytes    # Library for quantization (8-bit, 4-bit) to reduce memory usage
!pip install transformers    # Hugging Face Transformers library for model loading and training
!pip install datasets        # Hugging Face Datasets library for handling datasets
!pip install GPUtil          # Utility to check and manage GPU resources


## ⚙️ GPU Availability and Device Configuration

This section checks for the availability of a GPU and configures the CUDA device environment:

- Uses `GPUtil` to display current GPU utilization.
- Checks whether a CUDA-compatible GPU is available.
- If available, prints a confirmation; otherwise, sets the device to CPU.
- Sets environment variables to control CUDA device ordering and visibility.

In [3]:
import torch
import GPUtil
import os

# Display current GPU usage and memory stats
GPUtil.showUtilization()

# Check if a CUDA-compatible GPU is available
if torch.cuda.is_available():
    print("GPU is available")
else:
    device = torch.device("cpu")  # Fallback to CPU 
    print("GPU is not available, using CPU instead")

# Set environment variable to use CUDA device ordering based on PCI_BUS_ID
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Make only GPU 0 visible to CUDA (i.e., restrict training to GPU 0)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |
GPU is available


## 🧱 Import Libraries and Model Utilities

This section imports the core libraries and utilities needed for:

- Loading the base model (`AutoModelForCausalLM`) and tokenizer.
- Applying quantization configuration (`BitsAndBytesConfig`) for efficient memory usage.
- Accessing datasets using `load_dataset`.
- Logging into the Hugging Face Hub (`notebook_login`) to download/upload models.
- Preparing the model for efficient fine-tuning using **LoRA** via `peft`.


In [4]:
# Hugging Face Transformers core module
import transformers

# for load tokenizer and causal language model (e.g., Mistral-7B)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# For logging in to Hugging Face Hub (e.g., for access to private models or pushing results)
from huggingface_hub import notebook_login

# Load datasets from the Hugging Face Datasets library
from datasets import load_dataset

# Tools for parameter-efficient fine-tuning (PEFT) using LoRA
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model


2025-07-06 16:18:41.363234: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751818721.558019      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751818721.615637      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 🔐 Authenticate with Hugging Face and Weights & Biases

This section securely retrieves and configures authentication credentials for:

- **Hugging Face Hub**: To access or push models (via `hf_api_key`).
- **Weights & Biases (wandb)**: For experiment tracking and visualizing training metrics.

The keys are securely retrieved using Kaggle’s `UserSecretsClient` (useful in notebooks running on Kaggle). After authentication:
- The Hugging Face client is logged in for model interaction.
- Weights & Biases is initialized for tracking the continual pretraining run.


In [5]:
# Securely access stored secrets using Kaggle's UserSecretsClient
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Retrieve Hugging Face token and Weights & Biases token
hf_api_key = user_secrets.get_secret("HF token")
wandb_api_key = user_secrets.get_secret("wandb")

In [6]:
# Log in to Hugging Face Hub using the retrieved token
from huggingface_hub import login
login(token=hf_api_key)
print("Successfully logged into Hugging Face Hub!")

Successfully logged into Hugging Face Hub!


In [7]:
# Import Weights & Biases for experiment tracking
import wandb

# Authenticate to wandb using the retrieved API key
wandb.login(key=wandb_api_key)

# Initialize a new run in the specified wandb project
run = wandb.init(
    project="Cintinually Pre-trained Mistral-7B-Instruct-v0.3",  # Typo in 'Continually' left unchanged
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33magaba_embedded[0m ([33magaba_embedded4[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 🧠 Load Mistral-7B-Instruct with 4-Bit Quantization

In this step, we load the base model for continual pretraining:

- **Model**: [`mistralai/Mistral-7B-Instruct-v0.3`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
  - A **7 billion parameter**, decoder-only transformer model.
  - Trained by [Mistral AI](https://mistral.ai) for **instruction following** (chat-style use cases).
  - It builds on the original Mistral-7B foundation model and is fine-tuned for better alignment and task following.
  - Suitable for use cases like QA, summarization, and dialog — even more so after domain adaptation.

We apply **4-bit quantization** using the `BitsAndBytesConfig` to reduce GPU memory usage:

- `load_in_4bit=True`: Enables loading model weights in 4-bit precision.
- `bnb_4bit_use_double_quant=True`: Uses two-stage quantization for better accuracy.
- `bnb_4bit_quant_type="nf4"`: Uses the **normal float 4 (NF4)** quantization scheme — optimal for LLMs.
- `bnb_4bit_compute_dtype=torch.bfloat16`: Computation is done in `bfloat16` for performance + stability.

Quantization is critical when working with large models like Mistral-7B on limited hardware (e.g., single GPU).


In [8]:
# Define the base model to be used for continual pretraining
base_model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Configure 4-bit quantization using bitsandbytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,                        # Load model weights in 4-bit precision to save memory
    bnb_4bit_use_double_quant = True,           # Use double quantization (helps improve accuracy)
    bnb_4bit_quant_type = "nf4",                # Use the NF4 quantization format (optimized for transformers)
    bnb_4bit_compute_dtype = torch.bfloat16     # Use bfloat16 for computations (faster, stable)
)

# Load the pre-trained Mistral-7B-Instruct model with the specified quantization config
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config = bnb_config           # Apply quantization settings when loading the model
)


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## 📝 Prepare Telecom Text Dataset for Pretraining

This section loads the telecom corpus from a `.txt` file containing cleaned ITU standard texts and prepares it for continual pretraining.

### Key Steps:
- **Blank line removal**: Skips empty lines to avoid noise.
- **Chunking**: Groups every `N` lines (e.g., 10) into a single training sample.
  - This simulates longer text contexts while staying within memory limits.
  - You can increase `chunk_size` to make inputs more informative (if your system supports it).
- **Hugging Face Dataset**: Wraps the processed text into a format compatible with the Transformers training API.

This prepares a high-quality telecom-specific corpus for next-token prediction tasks.


In [9]:
from datasets import Dataset

# Read the raw telecom text file (cleaned ITU standards)
with open("/kaggle/input/text-dataset/combined_text_output.txt", "r") as f:
    # Remove blank lines and strip whitespace
    lines = [line.strip() for line in f if line.strip()]

# Group every N lines into a single training sample (you can adjust chunk_size based on memory)
chunk_size = 10  # Try increasing to 20 or 50 depending on available RAM and model context length
chunks = [" ".join(lines[i:i + chunk_size]) for i in range(0, len(lines), chunk_size)]

# Create a Hugging Face-compatible dataset with a 'text' column
train_dataset = Dataset.from_dict({"text": chunks})


In [None]:
train_dataset["text"][58]

In [None]:
train_dataset["text"][882]

In [None]:
train_dataset["text"][226]

## 🔡 Tokenization of Text Data

This section initializes the tokenizer and tokenizes the telecom dataset for model training.

### Key Actions:
- **Load Tokenizer**: Loads the tokenizer from the base model (`mistralai/Mistral-7B-Instruct-v0.3`).
  - `use_fast=False`: Uses the slower but more customizable version.
  - `trust_remote_code=True`: Allows loading custom tokenizer logic (as required by Mistral).
  - `add_eos_token=True`: Appends an EOS (end-of-sequence) token automatically.

- **Pad Token Handling**: Ensures the tokenizer has a padding token. If missing, it uses the EOS token as a surrogate pad token.

- **Tokenization Loop**: Iterates over the dataset and tokenizes each sample individually using the tokenizer.

This prepares the dataset in tokenized form, ready for continual pretraining using an autoregressive language modeling objective.


In [10]:
# Load the tokenizer from the base model (Mistral-7B-Instruct)
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    use_fast=False,                 # Use the slow tokenizer (more flexible/customizable)
    trust_remote_code=True,        # Trust custom tokenizer code from remote repo (required by Mistral)
    add_eos_token=True             # Automatically add EOS token at the end of sequences
)

# If the tokenizer doesn't have a pad token, use the EOS token instead
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})

# Tokenize the entire dataset
tokenized_train_dataset = []

# Loop through each text chunk and tokenize it
for phrase in train_dataset:
    tokenized_train_dataset.append(tokenizer(phrase["text"]))


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [11]:
tokenized_train_dataset[56]

{'input_ids': [1, 9639, 29498, 19854, 29501, 5207, 29515, 1152, 8587, 29515, 19771, 7598, 3626, 16440, 19771, 3436, 3680, 8350, 29473, 29551, 29501, 29508, 29538, 14519, 1040, 2639, 5944, 1070, 1040, 9639, 29498, 19854, 29501, 5207, 29491, 4291, 29491, 9630, 29547, 29501, 29506, 1395, 29491, 29538, 29508, 29555, 29552, 1093, 29502, 29542, 29516, 29518, 29502, 29518, 29502, 29499, 29473, 29508, 29538, 8350, 29473, 29551, 29501, 29508, 29538, 1532, 9639, 29498, 19854, 29501, 5207, 2639, 5944, 9916, 3210, 6475, 15340, 6145, 29516, 15883, 1500, 9590, 3350, 10988, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## 🧪 Enable Gradient Checkpointing and Apply LoRA for PEFT

This section prepares the Mistral-7B model for **efficient continual pretraining** by:

### 🔄 1. Gradient Checkpointing
- **Purpose**: Reduces memory usage during training by trading compute for memory.
- **Effect**: Intermediate activations are recomputed during backpropagation instead of being stored.

### 🔧 2. LoRA (Low-Rank Adaptation)
Applies LoRA configuration via the `peft` library to enable **parameter-efficient training**:
- Only a small set of parameters are trained, while the rest of the model is frozen.
- Ideal for large models when full fine-tuning is too costly.

### 📌 LoRA Config Explanation:
- `r=8`: Rank of the low-rank decomposition (smaller = more efficient).
- `lora_alpha=64`: Scaling factor for LoRA updates.
- `target_modules`: Specifies which transformer submodules to modify (projection layers).
- `bias="none"`: Only LoRA adapters are trained, not biases.
- `lora_dropout=0.05`: Dropout to help generalize the low-rank updates.
- `task_type="CAUSAL_LM"`: Specifies that this is for a causal language modeling task (autoregressive generation).

This configuration ensures training efficiency without sacrificing much performance.


In [12]:
# Enable gradient checkpointing to reduce memory usage during training
model.gradient_checkpointing_enable()

# Prepare the model for 4-bit training using PEFT (required before applying LoRA)
model = prepare_model_for_kbit_training(model)

# Configure LoRA: low-rank adaptation setup for efficient fine-tuning
config = LoraConfig(
    r=8,  # Rank for low-rank decomposition (smaller rank = more efficient, but lower capacity)
    lora_alpha=64,  # Scaling factor applied to the LoRA updates
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],  # Target linear layers in transformer
    bias="none",  # Do not fine-tune biases
    lora_dropout=0.05,  # Dropout rate to apply within LoRA layers
    task_type="CAUSAL_LM"  # Task type: causal language modeling (next-token prediction)
)

# Apply the LoRA configuration to the model using PEFT
model = get_peft_model(model, config)


## 🚀 Launch Continual Pretraining with Transformers Trainer

This section sets up and launches the training process using Hugging Face’s `Trainer` API.

### 🧠 Model Training Setup:
- **Model**: LoRA-adapted Mistral-7B-Instruct with 4-bit quantization.
- **Trainer API**: Manages training loop, gradient accumulation, logging, and saving.

### ⚙️ TrainingArguments Highlights:
- `output_dir`: Directory to save checkpoints and final model.
- `per_device_train_batch_size=2`: Small batch size due to large model size and memory constraints.
- `gradient_accumulation_steps=2`: Accumulates gradients to simulate larger batch size.
- `num_train_epochs=10`: Number of epochs to train.
- `learning_rate=1e-4`: Learning rate for fine-tuning.
- `optim="paged_adamw_8bit"`: Memory-efficient optimizer from bitsandbytes.
- `bf16=False`: No bfloat16 used here (can be enabled if supported by hardware).
- `save_strategy="epoch"`: Save model at the end of each epoch.
- `save_steps=50`: Save checkpoint every 50 steps (redundant here due to `save_strategy`, but harmless).
- `logging_steps=50`: Log metrics every 50 steps.
- `logging_dir="./log"`: Directory to store training logs.

### 🧱 Data Collator:
- Uses `DataCollatorForLanguageModeling` with `mlm=False`, since this is **causal language modeling** (next-token prediction), not masked language modeling.

Lastly, `model.config.use_cache` is disabled to avoid caching overhead, which interferes with gradient checkpointing.


In [13]:
# Initialize Hugging Face Trainer with model, dataset, training arguments, and data collator
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,

    # Define training hyperparameters and behaviors
    args=transformers.TrainingArguments(
        output_dir="./finetunedModel",           # Directory to save model checkpoints
        per_device_train_batch_size=2,           # Batch size per GPU
        gradient_accumulation_steps=2,           # Accumulate gradients to simulate larger batch size
        num_train_epochs=10,                     # Total number of training epochs
        learning_rate=1e-4,                      # Learning rate for optimizer
        # max_steps=2000,                        # Optional: train for fixed number of steps
        bf16=False,                              # Use bfloat16 if supported (False here)
        optim="paged_adamw_8bit",                # Use 8-bit AdamW optimizer from bitsandbytes
        logging_dir="./log",                     # Where to store training logs
        save_strategy="epoch",                   # Save model checkpoint at end of each epoch
        save_steps=50,                           # (Redundant if save_strategy="epoch") Save every 50 steps
        logging_steps=50                         # Log training metrics every 50 steps
    ),

    # Causal language modeling requires next-token prediction (not MLM)
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Disable cache usage (necessary for gradient checkpointing to work properly)
model.config.use_cache = False

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [15]:
# Start training
trainer.train()

## 🧪 Inference & Testing the Continually Pretrained Model

This section runs **inference** using the continually pretrained Mistral-7B-Instruct model to test its response on a telecom-specific query.

### 🧠 Prompt Construction:
- The input prompt is phrased to elicit a concise and accurate answer:



### ⚙️ Inference Process:
- The prompt is tokenized and moved to GPU.
- The model is set to evaluation mode (`model.eval()`).
- Inference is performed using `generate()` with a token limit of 1024.
- Output is decoded and printed without special tokens.
- GPU memory is cleared with `torch.cuda.empty_cache()` after each run.

> **Note**: This test function was executed **multiple times** using different telecom-related questions to evaluate how well the model responded after continual pretraining. The goal was to observe improvements in domain-specific reasoning and response fluency.

This evaluation approach helps validate the effectiveness of continual pretraining before moving on to instruction fine-tuning.


In [None]:
user_question = "what is network virtualization?"

eval_prompt = f"Just answer this question accurately and concisely.\nQuestion: {user_question} "

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()

with torch.no_grad():
  print(tokenizer.decode(model.generate(**promptTokenized, max_new_tokens=1024)[0], skip_special_tokens=True))
  torch.cuda.empty_cache()

In [None]:
user_question = "Explain Radio frequency shielding box and anechoic chamber?"

eval_prompt = f"Just answer this question accurately and concisely.\nQuestion: {user_question} "

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()

with torch.no_grad():
  print(tokenizer.decode(model.generate(**promptTokenized, max_new_tokens=1024)[0], skip_special_tokens=True))
  torch.cuda.empty_cache()

In [None]:
user_question = "Explain Immersion and spray liquid cooling technology?"

eval_prompt = f"Just answer this question accurately and concisely.\nQuestion: {user_question} "

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()

with torch.no_grad():
  print(tokenizer.decode(model.generate(**promptTokenized, max_new_tokens=1024)[0], skip_special_tokens=True))
  torch.cuda.empty_cache()

## ☁️ Merge LoRA Weights and Push Model to Hugging Face Hub

After continual pretraining, the **LoRA-adapted model** is merged back into the base Mistral-7B-Instruct weights for easier deployment and downstream fine-tuning.

### 🧩 What Happens Here:

- `model.merge_and_unload()`:
  - **Merges** the trained LoRA adapters into the base model weights.
  - **Unloads** the adapter-specific structure (making the model standard again).
  - Resulting model behaves like a fully fine-tuned Mistral-7B variant.

- `push_to_hub(new_model)`:
  - Uploads the final merged model and tokenizer to the Hugging Face Hub under the name:
    ```
    Cintinually-Pre-trained-Mistral-7B
    ```
    *(Note: Typo in “Continually” retained as per original code.)*

This version can now be **easily downloaded** later for further instruction fine-tuning using prompt-based datasets.

✅ This completes the continual pretraining phase.


In [None]:
model = model.merge_and_unload()  # This fuses the learned LoRA weights into the base model

In [None]:
new_model = "Cintinually-Pre-trained-Mistral-7B"
model.push_to_hub(new_model) # Online saving
tokenizer.push_to_hub(new_model) # Online saving

In [None]:
base_model_id = "Agaba-Embedded4/Cintinually-Pre-trained-Mistral-7B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16

)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config = bnb_config)

In [None]:
tokenizer = tokenizer = AutoTokenizer.from_pretrained("Agaba-Embedded4/Cintinually-Pre-trained-Mistral-7B", use_fast=False, trust_remote_code = True, add_eos_token = True)

In [None]:
user_question = "what is network slicing as it relates to 5G?"

eval_prompt = f"Just answer this question accurately and concisely.\nQuestion: {user_question} "

promptTokenized = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()

with torch.no_grad():
  print(tokenizer.decode(model.generate(**promptTokenized, max_new_tokens=1024)[0], skip_special_tokens=True))
  torch.cuda.empty_cache()