First, let's mention some key points.
1. We will use Hugging Face's Transformers, and its training and datasets API.
2. For training, CUDA 12.4 will be used. It's compatible with current PyTorch version. Refer to PyTorch docs for a table of compatible CUDA and PyTorch versions. Moreover, is isn't neccassary to import 'torch' and define 'device' variable, i.e, it is optional to manually set CUDA device to the model, and dataset, as Hugging Face Transformers will do it for us.
3. The training data is tiny, with just above 1000 samples, and far from being a corpus. Nonetheless, this project is for learning purposes.

In [57]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel # Import GPT-2's LM head and tokenizer.

In [None]:
model_ckpt = "gpt2"
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(model_ckpt)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token # Make sure we manually set the 'pad_token.'
gpt2_lmheadmodel = GPT2LMHeadModel.from_pretrained(model_ckpt)

# Set the model to evaluation mode.
gpt2_lmheadmodel.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Here, we can see GPT-2's structure. 

In [45]:
# Generating text from a prompt

prompt = "World War I was "
input_ids = gpt2_tokenizer.encode(prompt, return_tensors="pt")
gen_text = gpt2_lmheadmodel.generate(
    input_ids,
    max_length=128,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    pad_token_id=gpt2_tokenizer.eos_token_id
)

Before we begin fine-tuning, let's see how base GPT-2 performs. Here, we use some parameters do efficiently decode the outputs. Here's an explanation of all these parameters:

max_length=128	=> Sets the maximum length of the generated sequence (including the input prompt).

num_return_sequences=1 =>	Specifies the number of sequences to generate. Here, we generate only one output.

no_repeat_ngram_size=2 =>	Prevents repeating n-grams (subsequences of words) of size 2 or more, reducing repetitive text.

do_sample=True =>	Enables sampling instead of greedy decoding, making the output more diverse and creative.

top_k=50 =>	Uses Top-K sampling, where only the top 50 most probable words are considered at each step.

top_p=0.95 =>	Uses nucleus sampling (Top-P sampling), where only words with a cumulative probability of 95% are considered. This makes generation more dynamic.

temperature=0.7 =>	Controls creativity. Lower values (e.g., 0.5) make the model more conservative, while higher values (e.g., 1.2) make it more random.

pad_token_id=gpt2_tokenizer.eos_token_id =>	Ensures proper padding by using the end-of-sequence (EOS) token for GPT-2.

In [58]:
gen_text_decoded = gpt2_tokenizer.decode(
    gen_text[0], skip_special_tokens=True
)
print(f'GPT-2 says: \n{gen_text_decoded}')

GPT-2 says: 
World War I was  an era when the military and industrial powers in the Middle East were in power, and those forces were using nuclear weapons to control populations.
In the 1950s, when I began my research on nuclear energy, I realized that the most effective way to use nuclear power was to build a nuclear reactor. I quickly learned that there was no nuclear fuel, that nuclear reactors were not used to produce energy in a vacuum, but to create a mass of waste that would then be used for nuclear fuels. It took me many years to develop a plan for a new type of nuclear waste storage facility in New York City,


The generated text, to say the least, is inaccurate, and seemingly a bunch of WW1 terminology thrown and put together. One thing we can infer, is that its grammatical and morphological structure is sound.

In [7]:
# To improve GPT-2's performance, we will extract data from Wikipedia, using web_scrape_utility.py

from web_scrape_utility import extract_text_from_wiki

titles = [
    "World War I",
    "Battle of Verdun",
    "Schlieffen Plan",
    "Trench warfare",
    "Western Front (World War I)",
    "Eastern Front (World War I)",
    "Treaty of Versailles",
    "Battle of the Somme",
    "Armistice of 11 November 1918",
    "Gallipoli campaign",
    "Battle of Jutland",
    "Zimmermann Telegram",
    "Russian Revolution",
    "Central Powers",
    "Allied Powers (World War I)"
]

for title in titles:
    extract_text_from_wiki(title=title)

Wikipedia article was extracted from 'World War I' and saved to 'World War I.txt'
Wikipedia article was extracted from 'Battle of Verdun' and saved to 'Battle of Verdun.txt'
Wikipedia article was extracted from 'Schlieffen Plan' and saved to 'Schlieffen Plan.txt'
Wikipedia article was extracted from 'Trench warfare' and saved to 'Trench warfare.txt'
Wikipedia article was extracted from 'Western Front (World War I)' and saved to 'Western Front (World War I).txt'
Wikipedia article was extracted from 'Eastern Front (World War I)' and saved to 'Eastern Front (World War I).txt'
Wikipedia article was extracted from 'Treaty of Versailles' and saved to 'Treaty of Versailles.txt'
Wikipedia article was extracted from 'Battle of the Somme' and saved to 'Battle of the Somme.txt'
Wikipedia article was extracted from 'Armistice of 11 November 1918' and saved to 'Armistice of 11 November 1918.txt'
Wikipedia article was extracted from 'Gallipoli campaign' and saved to 'Gallipoli campaign.txt'
Wikipedi

So, let's begin with first step. We will extract text from Wikipedia articles. Wikipedia provides numerous articles which can be scraped easily. Here, web_scrape_utility.py is used.

In [8]:
print(f'Extracted text from {len(titles)} articles from Wikipedia.')

Extracted text from 15 articles from Wikipedia.


In [22]:
import os

# List of titles from your previous code
titles = [
    "World War I",
    "Battle of Verdun",
    "Schlieffen Plan",
    "Trench warfare",
    "Western Front (World War I)",
    "Eastern Front (World War I)",
    "Treaty of Versailles",
    "Battle of the Somme",
    "Armistice of 11 November 1918",
    "Gallipoli campaign",
    "Battle of Jutland",
    "Zimmermann Telegram",
    "Russian Revolution",
    "Central Powers",
    "Allied Powers (World War I)"
]

# Output file for the corpus
corpus_file = "ww1_corpus.txt"

try:
    # Open the corpus file in write mode
    with open(corpus_file, "w", encoding="utf-8") as output:
        # Iterate through each title
        for title in titles:
            input_file = f"{title}.txt"
            if os.path.exists(input_file):
                # Read the content of the individual file
                with open(input_file, "r", encoding="utf-8") as input_txt:
                    content = input_txt.read()
                    # Write the title as a header (optional) and the content
                    output.write(f"\n\n=== {title} ===\n\n")  # Separator for clarity
                    output.write(content)
                print(f"Appended {input_file} to {corpus_file}")
            else:
                print(f"Warning: {input_file} not found, skipping...")

    print(f"Corpus successfully created in {corpus_file}")

except IOError as e:
    print(f"Error creating corpus: {e}")

Appended World War I.txt to ww1_corpus.txt
Appended Battle of Verdun.txt to ww1_corpus.txt
Appended Schlieffen Plan.txt to ww1_corpus.txt
Appended Trench warfare.txt to ww1_corpus.txt
Appended Western Front (World War I).txt to ww1_corpus.txt
Appended Eastern Front (World War I).txt to ww1_corpus.txt
Appended Treaty of Versailles.txt to ww1_corpus.txt
Appended Battle of the Somme.txt to ww1_corpus.txt
Appended Armistice of 11 November 1918.txt to ww1_corpus.txt
Appended Gallipoli campaign.txt to ww1_corpus.txt
Appended Battle of Jutland.txt to ww1_corpus.txt
Appended Zimmermann Telegram.txt to ww1_corpus.txt
Appended Russian Revolution.txt to ww1_corpus.txt
Appended Central Powers.txt to ww1_corpus.txt
Appended Allied Powers (World War I).txt to ww1_corpus.txt
Corpus successfully created in ww1_corpus.txt


So, we compiled all the .txt files in a single 'corpus,' which I admit, is no corpus by any means, so that we can feed it to GPT-2.

In [None]:
from datasets import load_dataset


corpus_file = "ww1_corpus.txt"

# Load the text file as a dataset
dataset = load_dataset(
    "text",  
    data_files=corpus_file,
    split="train"  # Load as a single "train" split
)


print(f"Dataset loaded with {len(dataset)} examples")
print("First example:")
print(dataset[0])
corpus_text = dataset["text"]  # List of lines from the file
print(f"Number of lines in corpus: {len(corpus_text)}")
print("Sample line:")
print(corpus_text[0])

Generating train split: 0 examples [00:00, ? examples/s]

Dataset loaded with 2792 examples
First example:
{'text': ''}
Number of lines in corpus: 2792
Sample line:



We will be using a number of classes and functions, generously provided by Hugging Face. To name them, we'd use:
1. load_dataset(), to create a dataset, it supports .txt files!
2. GPT2Tokenizer and GPT2LMHeadModel, which we've imported already.
3. TrainingArguments to define, well, training arguments.
4. Trainer, which carries out the fine-tuning/ training process.
5. DataCollatorForLanguageModelling class.

Moreover, load_dataset returns a DatasetDict class, which has some pretty useful attributes and methods. But above, we see a problem, which is that some examples in the dataset are just empty. So, we'd need to deal with it by removing the empty samples.

In [None]:
from datasets import load_dataset


corpus_file = "ww1_corpus.txt"
dataset = load_dataset("text", data_files=corpus_file, split="train")

# Filter out empty lines and save to a new file
cleaned_file = "ww1_corpus_cleaned.txt"
with open(cleaned_file, "w", encoding="utf-8") as f:
    for line in dataset["text"]:
        if line.strip():  # Only write non-empty lines
            f.write(line + "\n")

# Reload the cleaned dataset
dataset = load_dataset("text", data_files=cleaned_file, split="train")
print(f"Cleaned dataset has {len(dataset)} examples")
print("First example:", dataset[0])

Generating train split: 0 examples [00:00, ? examples/s]

Cleaned dataset has 1441 examples
First example: {'text': '=== World War I ==='}


In [27]:
dataset

Dataset({
    features: ['text'],
    num_rows: 1441
})

In [28]:
for example in dataset['text'][:10]:
    print(example, end="\n\n")

=== World War I ===

World War I[b] or the First World War (28 July 1914 – 11 November 1918), also known as the Great War, was a global conflict between two coalitions: the Allies (or Entente) and the Central Powers. Fighting took place mainly in Europe and the Middle East, as well as in parts of Africa and the Asia-Pacific, and in Europe was characterised by trench warfare; the widespread use of artillery, machine guns, and chemical weapons (gas); and the introductions of tanks and aircraft. World War I was one of the deadliest conflicts in history, resulting in an estimated 10 million military dead and more than 20 million wounded, plus some 10 million civilian dead from causes including genocide. The movement of large numbers of people was a major factor in the deadly Spanish flu pandemic.

The causes of World War I included the rise of Germany and decline of the Ottoman Empire, which disturbed the long-standing balance of power in Europe, as well as economic competition between nat

Looks like WW1 history alright! Now, let's see an example of how GPT-2's tokenizer converts raw text into token IDs, which are numerical representations of the input text.

In [31]:
# A glance at how does the GPT2 tokenizer does its job.

sample = dataset['text'][1]
sample_tokenized = gpt2_tokenizer.encode(sample, return_tensors="pt")
sample_tokenized

tensor([[10603,  1810,   314,    58,    65,    60,   393,   262,  3274,  2159,
          1810,   357,  2078,  2901, 26833,   784,  1367,  3389, 25859,   828,
           635,  1900,   355,   262,  3878,  1810,    11,   373,   257,  3298,
          5358,  1022,   734,  5655,  1756,    25,   262, 32430,   357,   273,
          7232, 21872,     8,   290,   262,  5694, 20668,    13, 19098,  1718,
          1295,  8384,   287,  2031,   290,   262,  6046,  3687,    11,   355,
           880,   355,   287,  3354,   286,  5478,   290,   262,  7229,    12,
         22933,    11,   290,   287,  2031,   373,  2095,  1417,   416, 35091,
         15611,    26,   262, 10095,   779,   286, 20381,    11,  4572,  6541,
            11,   290,  5931,  3777,   357, 22649,  1776,   290,   262,  3120,
          2733,   286, 11657,   290,  6215,    13,  2159,  1810,   314,   373,
           530,   286,   262, 39268, 12333,   287,  2106,    11,  7186,   287,
           281,  6108,   838,  1510,  2422,  2636,  

In [34]:
for token_id in sample_tokenized[0]:
    print(f'Token ID: {token_id}\nToken: {gpt2_tokenizer.decode(token_id)}')

Token ID: 10603
Token: World
Token ID: 1810
Token:  War
Token ID: 314
Token:  I
Token ID: 58
Token: [
Token ID: 65
Token: b
Token ID: 60
Token: ]
Token ID: 393
Token:  or
Token ID: 262
Token:  the
Token ID: 3274
Token:  First
Token ID: 2159
Token:  World
Token ID: 1810
Token:  War
Token ID: 357
Token:  (
Token ID: 2078
Token: 28
Token ID: 2901
Token:  July
Token ID: 26833
Token:  1914
Token ID: 784
Token:  –
Token ID: 1367
Token:  11
Token ID: 3389
Token:  November
Token ID: 25859
Token:  1918
Token ID: 828
Token: ),
Token ID: 635
Token:  also
Token ID: 1900
Token:  known
Token ID: 355
Token:  as
Token ID: 262
Token:  the
Token ID: 3878
Token:  Great
Token ID: 1810
Token:  War
Token ID: 11
Token: ,
Token ID: 373
Token:  was
Token ID: 257
Token:  a
Token ID: 3298
Token:  global
Token ID: 5358
Token:  conflict
Token ID: 1022
Token:  between
Token ID: 734
Token:  two
Token ID: 5655
Token:  coal
Token ID: 1756
Token: itions
Token ID: 25
Token: :
Token ID: 262
Token:  the
Token ID: 32430
To

Each token is assigned a number, based on the contexts, rather than random assignment. Indeed, this allosws language models to learn relationships in the data.

In [47]:
# Define the tokenizer function.

def tokenizer_fxn(examples):
    return gpt2_tokenizer(examples['text'], truncation=True, max_length=512)

In [48]:
# Map the dataset, i.e. encode the training samples

dataset_encoded = dataset.map(tokenizer_fxn, batched=True, remove_columns=['text'])

Map:   0%|          | 0/1441 [00:00<?, ? examples/s]

Now our dataset has been encoded. Next, we generate a train-test split, keeping ~ 10% dataset for testing. Then we define the hyper-parameters, and begin the process of fine-tuning.

In [49]:
# Generate the train-test split.

train_test_split = dataset_encoded.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
print(f'Train size: {len(train_dataset)}, test size: {len(test_dataset)}')

Train size: 1296, test size: 145


In [50]:
# Define the data collator.

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=gpt2_tokenizer, mlm=False)

In [None]:
# Define the training arguments, and use Hugging Face's trainer API for fine-tuning.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./gpt-ww1-finetuned',
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_eval_batch_size=4,
    per_device_train_batch_size=4,
    eval_steps=200,
    save_steps=200,
    warmup_steps=50,
    eval_strategy="steps",
    logging_steps=50,
    learning_rate=5e-5,
    save_total_limit=2
)

# Define the trainer.

trainer = Trainer(
    model=gpt2_lmheadmodel,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator
)

In [None]:
# Start by calling .train(), and save the model.
trainer.train()
trainer.save_model('./gpt2-ww1-finetuned')
gpt2_tokenizer.save_pretrained('./gpt2-ww1-finetuned')

Step,Training Loss,Validation Loss
200,3.498,3.363681
400,3.2048,3.326559
600,3.1388,3.302164
800,2.9922,3.306558
1000,2.8945,3.306001
1200,2.8003,3.306996
1400,2.7374,3.317396
1600,2.7289,3.318511


('./gpt2-ww1-finetuned\\tokenizer_config.json',
 './gpt2-ww1-finetuned\\special_tokens_map.json',
 './gpt2-ww1-finetuned\\vocab.json',
 './gpt2-ww1-finetuned\\merges.txt',
 './gpt2-ww1-finetuned\\added_tokens.json')

The training loss decreases consistently, while validation loss hovers around ~ 3.30. Our model has certainly learned something! With our training done, let's proceed to generate some text.

In [63]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./gpt2-ww1-finetuned")
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-ww1-finetuned")
fine_tuned_model.eval()

# Define a prompt
prompt = "World War I was"
input_ids = fine_tuned_tokenizer.encode(prompt, return_tensors="pt")

# Generate text
output = fine_tuned_model.generate(
    input_ids,
    max_length=200,  # Increased length for better context
    do_sample=True,  # Sampling for diverse output
    top_k=40,        # Reduce randomness slightly (more controlled than 50)
    top_p=0.92,      # Slightly lower nucleus sampling to reduce extreme randomness
    temperature=0.65, # Lowered to balance creativity and coherence
    repetition_penalty=1.2,  # Reduces looping and repeated phrases
    no_repeat_ngram_size=3,  # Avoids repeating 3-grams
    early_stopping=True,  # Stops if output is complete
    pad_token_id=fine_tuned_tokenizer.eos_token_id
)


# Decode and print the generated text
generated_text = fine_tuned_tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated text:")
print(generated_text)



Generated text:
World War I was a significant event in European history, and the war saw two major revolutions: one by Germany's Allies against France (1914–18), the other as an attempt to end World Wars II. The first revolution involved Russia seizing power from Austria-Hungary; later that year Britain declared victory over French rule.[43] In 1914 after Franco-British support for British expansionism faltered, the Bolsheviks overthrew Tsar Nicholas V in 1917 with Russian backing. However this did not last long; many revolutionaries died before they could be replaced by their own party or political movement.[44][45]:[46]) This led both sides towards civil wars which continued into 1918,[47]. Although there were some successes such at Verdun, these resulted mainly of violent clashes between combatants – usually unarmed protestors who attempted nonviolent protests but became increasingly isolated due either lacklustre leadership skills or poor organisation.[48] During the Second Battle 

Right. So, the text is, again, totally inaccurate. But there's a notable shift, which is the fact that writing style now resembles Wikipedia's style quite a lot. It's clear that training data was not enough to allow GPT-2 to go into conceptual depth. We can indeed use a book to fine-tune it further.

In [65]:
from pdf_utility import pdf_to_txt

pdf_path = r"C:\Users\HP\OneDrive\Desktop\Books\History\The Great War_ A Combat History of the First World War (1).pdf"
txt_path = r"the_great_war_book.txt"

pdf_to_txt(pdf_path=pdf_path, txt_path=txt_path)

We will use pdf_to_txt() which we defined in pdf_utility.py, and extract text from The Great War by Peter Hart. It's available on Internet Archive, and is a fairly in-depth exploration of WW1. (I hope I don't face copyright issues, haha.)

In [66]:
pdf_corpus = "the_great_war_book.txt"

In [68]:
# A glance at the corpus from PDF.

# Load the new corpus
pdf_corpus = "the_great_war_book.txt"
with open(pdf_corpus, "r", encoding="utf-8") as f:
    new_corpus_text = f.read()

# Basic inspection
print(f"Total characters: {len(new_corpus_text)}")
print(f"First 500 characters:\n{new_corpus_text[:500]}")
print(f"Number of lines: {len(new_corpus_text.splitlines())}")

Total characters: 1281708
First 500 characters:

		£	©	½	¾	É	Ü	à	â	ä	ç	è	é	ê	ó	ô	ö	ü	Ł	
ś
	&	–	—	‘	’	“	”	…
THE	GREAT	WAR
ALSO	BY	PETER	HART
Gallipoli
1918:	A	Very	British	Victory
Aces	Falling:	War	Above	the	Trenches,	1918
The	Somme
Bloody	April:	Slaughter	in	the	Skies	Over	Arras,	1917
Somme	Success:	The	RFC	and	the	Battle	of	the	Somme
WITH	NIGEL	STEEL
Tumult	in	the	Clouds
Defeat	at	Gallipoli
Passchendaele
Jutland	1916
THE	GREAT	WAR
A	Combat	History	of	the
First	World	War
PETER	HART

Oxford	University	Press
Oxford	University	Press	is	a	departm
Number of lines: 28680


These are definitely the opening few pages. Alright, same drill. Load, preprocess, train, and evaluate. Let's get it done.

In [70]:
from datasets import load_dataset

# Load the new corpus as a dataset
dataset = load_dataset("text", data_files=pdf_corpus, split="train")

# Filter out empty lines
dataset = dataset.filter(lambda example: example["text"].strip() != "")
print(f"Dataset size: {len(dataset)} examples")

Dataset size: 28633 examples


We have a lot more training samples this time. We have already removed the empty rows. Let's proceed with tokenization.

In [None]:
from transformers import GPT2Tokenizer

# Load the fine-tuned tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-ww1-finetuned")
tokenizer.pad_token = tokenizer.eos_token  # Ensure padding is set

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

# Tokenize
train_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

In [73]:
# Set up the data collator.
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [76]:
# Load the previously fine-tuned model.
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("./gpt2-ww1-finetuned")

In [89]:
# Define the training arguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./gpt2-ww1-finetuned-v2",  # New directory for this version
    overwrite_output_dir=True,
    num_train_epochs=3,  # Fewer epochs for additional fine-tuning
    per_device_train_batch_size=4,
    save_steps=200,
    warmup_steps=50,
    logging_steps=200,
    learning_rate=2e-5,  # Lower LR for fine-tuning an already trained model
    save_total_limit=2,
)

In this case, we'd only train for 3 epochs, since our fine-tined (v1) already knows all the basics, but lacks depth. We'd use a lower learning rate, so that the model preserves previous learnings.

In [90]:
# Define the trainer. 

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)


In [91]:
# Start training
trainer.train()
trainer.save_model("./gpt2-ww1-finetuned-v2")
tokenizer.save_pretrained("./gpt2-ww1-finetuned-v2")

Step,Training Loss
200,2.2625
400,2.2913
600,2.6462
800,2.6071
1000,2.6267
1200,2.5951
1400,2.5221
1600,2.5857
1800,2.51
2000,2.5126


('./gpt2-ww1-finetuned-v2\\tokenizer_config.json',
 './gpt2-ww1-finetuned-v2\\special_tokens_map.json',
 './gpt2-ww1-finetuned-v2\\vocab.json',
 './gpt2-ww1-finetuned-v2\\merges.txt',
 './gpt2-ww1-finetuned-v2\\added_tokens.json')

The training process was clearly a lot more messy this time. It failed to go beyond ~2.00. But let's hope that it learned something!

In [95]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned v2 model and tokenizer
fine_tuned_model_v2 = GPT2LMHeadModel.from_pretrained("./gpt2-ww1-finetuned-v2")
fine_tuned_tokenizer_v2 = GPT2Tokenizer.from_pretrained("./gpt2-ww1-finetuned-v2")
fine_tuned_model_v2.eval()

# Define a prompt
prompt = "World War I was"
input_ids = fine_tuned_tokenizer_v2.encode(prompt, return_tensors="pt")

# Generate text
output = fine_tuned_model_v2.generate(
    input_ids,
    max_length=150,          # Adjust for length
    do_sample=True,          # Creative sampling
    top_k=50,                # Top-k sampling
    top_p=0.95,              # Top-p sampling
    temperature=0.7,         # Balanced randomness
    no_repeat_ngram_size=2,  # Prevent repetition
    pad_token_id=fine_tuned_tokenizer.eos_token_id
)

# Decode and print
generated_text = fine_tuned_tokenizer_v2.decode(output[0], skip_special_tokens=True)
print("Generated text:")
print(generated_text)

Generated text:
World War I was	a	dreadful	experience	for	the	Austrians.	The	Russians	had	to	get	over	them	and	win	their	war	on	31	July.’	But	this	was	not	just	an	enormous	wonderful,	it	is	nigh-inviolable.,‘The”	Germans	were	also	failing	–	in	what	seemed	like	everywhere,–“‛	until	they	happened	at	12.40	pm	when	Joffre	showed	up


Right, so now the model has adapted to The Great War by Peter Hart. This generated sequence is slightly less nonsensical than before. Now, why the model struggles to learn history could be due to underfitting. We used the base model, which may not be powerful enough, as it contains 124 million parameters, as compared to XL version, which comes with 1.5 billion parameters. Anyway, our key learnings from this project:
Key Learnings

1. Model Limitations

We used GPT-2 Base (124M parameters), which showed limitations in capturing long-term coherence and historical accuracy.

Generated text exhibited nonsensical phrasing and factual inaccuracies, likely due to the model’s limited parameter size.

The model struggled with dates, names, and logical event sequences, leading to unrealistic outputs.

2. Training Process Observations

Training loss steadily decreased, but validation loss was initially NaN due to improper dataset formatting.

The model started to produce less chaotic but still flawed generations after multiple iterations.

Larger context windows improved text structure, but did not significantly enhance factual accuracy.

3. Data Preprocessing Challenges

The dataset contained empty or malformed samples, which affected training stability.

Special characters, inconsistent formatting, and line breaks caused tokenization issues.

Some historical terms and dates were not properly learned due to inconsistent representation.

Areas for Improvement

1. Upgrade Model Size

Fine-tuning GPT-2 Medium (355M) or Large (774M) can improve coherence and factual consistency.

Consider switching to LLama 2 or Falcon, which are more efficient and powerful than GPT-2.

2. Better Training Strategy

Use more epochs, but carefully monitor for overfitting.

Implement LoRA/QLoRA for parameter-efficient fine-tuning.

Experiment with learning rate scheduling for better convergence.

3. Enhanced Preprocessing

Clean dataset by removing empty samples and malformed text.

Standardize date and name formats to help with consistency.

Improve tokenization handling for special characters.

4. Alternative Sampling Methods

Experiment with different temperature, top-k, and top-p values for more controlled text generation.

Implement beam search instead of sampling for more structured output.