Here's how to build your own LLM model using Google Colab:

1.  **Set up the environment:**
    *   Open a new Google Colab notebook.
    *   Enable GPU acceleration: Navigate to "Runtime" > "Change runtime type" and select "GPU" under "Hardware accelerator."
    *   Install necessary libraries, such as `transformers`, `datasets`, and `accelerate`.

In [None]:
!pip install transformers datasets accelerate

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from

2.  **Load and preprocess your dataset:**
    *   Choose a dataset suitable for language modeling. You can use datasets from the Hugging Face Hub or upload your own.
    *   Load the dataset using the `datasets` library.
    *   Tokenize the text data using a suitable tokenizer (e.g., from the `transformers` library).
    *   Group the tokenized sequences into chunks of a fixed size.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch

# Load dataset
dataset = load_dataset("text", data_files="text_file.txt")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

def group_texts(examples):
    # Convert string values to lists if necessary
    for k in examples.keys():
        if isinstance(examples[k][0], str):
            examples[k] = [[item] for item in examples[k]]

    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Check if concatenated_examples is empty
    if not concatenated_examples or not concatenated_examples[list(examples.keys())[0]]:
        # Handle empty case, e.g., return an empty dictionary
        print("Warning: concatenated_examples is empty. Returning an empty dictionary.")
        return {}

    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // 128) * 128

    # Check if total_length is 0 and handle it
    # If total_length is 0, pad the input to 128 tokens
    if total_length == 0:
        print("Warning: total_length is 0. Padding the input to 128 tokens.")
        # Assume 'input_ids' is the key for tokenized input
        concatenated_examples['input_ids'] = [[tokenizer.pad_token_id] * 128]
        total_length = 128

    result = {
        k: [t[i : i + 128] for i in range(0, total_length, 128)] # removed torch.tensor here
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)

model = AutoModelForCausalLM.from_pretrained("gpt2")
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    save_steps=1000,
    logging_steps=100,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    data_collator=lambda data: {
        "input_ids": torch.tensor([f["input_ids"] for f in data]),
        "attention_mask": torch.tensor([f["attention_mask"] for f in data]),
        "labels": torch.tensor([f["labels"] for f in data]),
    },
)

# Check if train_dataset is empty before training
if len(trainer.train_dataset) > 0:
    trainer.train()
    trainer.save_model("./my_llm_model")
else:
    print("Error: The training dataset is empty. Please check your data and preprocessing steps.")

Map:   0%|          | 0/1 [00:00<?, ? examples/s]



`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


In [None]:
from huggingface_hub import login
login(token="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")

In [None]:
trainer.save_model("llama2-7b-qlora-finetuned")


In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Assuming you saved your fine-tuned model in the current directory under "my_llm_model"
model = AutoModelForCausalLM.from_pretrained("./my_llm_model", device_map="auto")
# Load the tokenizer that was used during training - "gpt2" in this case
tokenizer = AutoTokenizer.from_pretrained("gpt2")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """
What is a dog?

Response:
"""

result = pipe(
    prompt,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
    eos_token_id=tokenizer.eos_token_id,
)[0]['generated_text']

print(result)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



What is a dog?

Response:
... The English word for an animal, to be or as it were; from the Latin ininum ("to take") means "be". A cat has three sexes of male and female—the dominant one being called mare by its Spanish parents (Merella e una) while the other two are named after her own species when they first came into contact with humans at birth.[12] Dogs can also have their name changed if that change does not occur naturally within them,[13][14]. One would assume such changes will eventually happen but this theory assumes otherwise since no natural process occurs before any human-specific events happened so there must still exist some sort [15], though perhaps more likely these things may never actually make themselves known even now because nobody knows what happens next on earth once all life forms had been created anew[16]."Dogs" meaning dogs was coined during World War II which saw many American military officers become familiarised directly upon hearing about European German