üß© Step 1: Install dependencies

In [38]:
!pip install transformers datasets torch accelerate

[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/commands/install.py", line 578, in _determine_conflicts
    return check_install_conflicts(to_install)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/operations/check.py", line 101, in check_install_conflicts
    package_set, _

üß© Step 2: Import Libraries

In [39]:
from transformers import BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
import torch

üß© Step 3: Prepare the Dataset

If you have your corpus (say harry_potter_corpus.txt), upload it to Colab first.

In [None]:
from google.colab import files
uploaded = files.upload()  # upload harry_potter_corpus.txt

Then create a HuggingFace dataset from it:

In [None]:
dataset = load_dataset("text", data_files={"train": "harry_potter_corpus.txt"})
print(dataset)

üß© Step 4: Tokenize the text

We‚Äôll use BERT‚Äôs tokenizer to split text into tokens.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

üß© Step 5: Create Data Collator

This helps dynamically mask random words during training for the MLM (Masked Language Modeling) task.

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

üß© Step 6: Load Pretrained BERT

In [None]:
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

üß© Step 7: Define Training Arguments

We‚Äôll fine-tune for a few epochs (keep it light for Colab).

In [None]:
training_args = TrainingArguments(
    output_dir="./bert-harrypotter",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    logging_steps=100
)

üß© Step 8: Create Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"]
)

üß© Step 9: Fine-tune the model

In [None]:
trainer.train()

This step will:

Randomly mask 15% of tokens (like replacing ‚Äúmagic‚Äù ‚Üí ‚Äú[MASK]‚Äù)

Train BERT to predict them

Adapt the model to Harry Potter‚Äôs vocabulary

üß© Step 10: Save your fine-tuned model

In [None]:
trainer.save_model("./bert-harrypotter-finetuned")
tokenizer.save_pretrained("./bert-harrypotter-finetuned")

üß™ 3. Use the Fine-tuned Model

Now you can load it again and use it for masked word predictions.

In [None]:
from transformers import pipeline

fill_mask = pipeline("fill-mask", model="./bert-harrypotter-finetuned", tokenizer="./bert-harrypotter-finetuned")

prompt = "Harry looked at Ron and said it was a [MASK] day at Hogwarts."
for pred in fill_mask(prompt):
    print(f"{pred['token_str']}: {pred['score']:.4f}")

Example Output:

magical: 0.4231

beautiful: 0.2122

strange: 0.1048

cold: 0.0873

wonderful: 0.0657


Now your model has learned the Harry Potter tone! ü™Ñ

üß© 4.Generate Harry Potter-style text

You can combine your BERT fine-tuned model with a small GPT-2 generator to make creative completions.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
prompt = "At Hogwarts, Hermione discovered a hidden chamber where"
print(generator(prompt, max_length=40, temperature=0.8)[0]['generated_text'])

üß© 5. Explanation to Teach

| Concept                  | What Students Learn                              |
| ------------------------ | ------------------------------------------------ |
| Pretrained Model         | BERT already knows English                       |
| Fine-Tuning              | We adapt it to new domain (Harry Potter)         |
| Masked Language Modeling | Predict missing words                            |
| Tokenization             | Converts words ‚Üí numbers                         |
| Data Collator            | Randomly masks words for training                |
| Trainer API              | Handles training loops, checkpoints              |
| Output                   | Model now ‚Äútalks‚Äù like the Harry Potter universe |



ü™Ñ Recap Workflow

Upload Corpus ‚Üí Harry Potter books or fan dataset

Tokenize Text ‚Üí Convert to BERT-friendly format

Fine-Tune ‚Üí Train BERT for a few epochs

Save Model ‚Üí bert-harrypotter-finetuned

Use It! ‚Üí Masked word prediction or sentiment analysis