# Eindopdracht 3
# Character-level LSTM met Keras — Iliad

Naam: Sietse Neve

Studentnummer: 1810364

**Doel:** een char-level LSTM trainen op `iliad.txt` (lowercased, verder raw), met:
- sliding window van lengte **m = 100** karakters → voorspellen **volgend** karakter
- model: LSTM(256, return_sequences=True) → Dropout(0.2) → LSTM(256) → Dropout(0.1) → Dense(|V|, softmax)
- optimizer **Adam**, loss **categorical_crossentropy**
- **ModelCheckpoint** op `loss` (save_best_only)
- na trainen: 1000 karakters genereren uit een random startsequence

**Waarom char-level (i.p.v. word-level)?**
- Geen tokenisatie/cleaning nodig; model leert spelling/patronen direct uit ruwe karakters.
- Nadeel: langere sequenties nodig om zinsstructuur te vatten en vaak trager leren.

**Wat ik ga loggen/kiezen:**
- X vorm: `(n_sequences, 100, 1)` (1 feature = integer per timestep)
- X normalisatie: delen door vocab_size (zodat input ≈ [0,1])
- y: one-hot met `to_categorical`
- 20 epochs, batch_size 128 (of minder als resources krap zijn)


In [11]:
# STAP 1: LAAD ILIAD, RAW CHAR DATA, LOWERCASE
# ---------------------------------------------
# Opdracht: "Laad de data uit iliad.txt ... en converteer naar lower case.
# Geen verdere cleaning; we gebruiken ruwe character data."

import os
import numpy as np
import tensorflow as tf

from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint

# Reproduceerbaarheid (voor zover mogelijk met GPU/cuDNN)
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Pad naar databestand
ILIAD_PATH = "iliad.txt"

# Sliding-window lengte m (volgens opdracht = 100)
SEQ_LEN = 100

# Aantal te genereren karakters
GEN_STEPS = 1000

# Checkpoint-pad (beste loss)
os.makedirs("checkpoints", exist_ok=True)
BEST_CKPT_PATH = "checkpoints/lstm_char_best.h5"

# 1) Laad en lowercase
with open(ILIAD_PATH, "r", encoding="utf-8", errors="ignore") as f:
    raw_text = f.read().lower()

print(f"Tekstlengte (chars): {len(raw_text):,}")
print("Voorbeeld (eerste 200 tekens):")
print(raw_text[:200])

import tensorflow as tf
print("TF versie:", tf.__version__)
print("GPU beschikbaar:", tf.config.list_physical_devices("GPU"))



Tekstlengte (chars): 1,116,792
Voorbeeld (eerste 200 tekens):
the project gutenberg ebook of the iliad
    
this ebook is for the use of anyone anywhere in the united states and
most other parts of the world at no cost and with almost no restrictions
whatsoever.
TF versie: 2.19.0
GPU beschikbaar: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [12]:
# STAP 2: UNIEKE KARAKTERS, MAPPINGS, SLIDING WINDOW
# ---------------------------------------------------
# - Maak lijst unieke chars en dicts: char_to_int, int_to_char
# - Sliding window (lengte 100): X = sequences van ints, Y = 'volgend' char (als int)

# Unieke karakters
chars = sorted(list(set(raw_text)))
vocab_size = len(chars)
print("Vocab size:", vocab_size)
print("Chars (eerste 100):", chars[:100])

# Mappings
char_to_int = {c: i for i, c in enumerate(chars)}
int_to_char = {i: c for c, i in char_to_int.items()}

# Sliding window
X_int = []  # elke entry is lijst met 100 ints
Y_int = []  # target: int van het karakter na het window

for i in range(0, len(raw_text) - SEQ_LEN):
    seq_in = raw_text[i : i + SEQ_LEN]
    seq_out = raw_text[i + SEQ_LEN]  # het eerstvolgende karakter
    X_int.append([char_to_int[ch] for ch in seq_in])
    Y_int.append(char_to_int[seq_out])

X_int = np.array(X_int, dtype=np.int32)
Y_int = np.array(Y_int, dtype=np.int32)

print(f"Aantal sequences (n): {X_int.shape[0]:,} | Sequence length (m): {SEQ_LEN}")
print("Voorbeeld X_int[0][:20]:", X_int[0][:20], "-> Y_int[0]:", Y_int[0], f"('{int_to_char[Y_int[0]]}')")


Vocab size: 146
Chars (eerste 100): ['\n', ' ', '!', '#', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '§', 'à', 'ä', 'æ', 'è', 'é', 'ê', 'ë', 'ï', 'ò', 'ô', 'ö', 'ù', 'ü', 'œ', 'ά', 'έ', 'ή', 'ί', 'α', 'β', 'γ', 'δ', 'ε', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'ρ', 'ς', 'σ', 'τ', 'υ', 'φ', 'χ', 'ω', 'ό']
Aantal sequences (n): 1,116,692 | Sequence length (m): 100
Voorbeeld X_int[0][:20]: [50 38 35  1 46 48 45 40 35 33 50  1 37 51 50 35 44 32 35 48] -> Y_int[0]: 39 ('i')


In [13]:
# STAP 3: RESHAPE NAAR 3D TENSOR (n, m, 1)
# ----------------------------------------
# Opdracht: "Reshape de data in X naar (n, m, 1) ... 1 is aantal features, integer-data."
# Let op: normaliseren doen we pas in stap 4.

n_sequences = X_int.shape[0]        # n
m = SEQ_LEN                         # m (100)
X = X_int.reshape((n_sequences, m, 1))  # (n, m, 1)

print("X dtype:", X.dtype)
print("X shape:", X.shape)


X dtype: int32
X shape: (1116692, 100, 1)


In [14]:
# STAP 4: NORMALISEREN
# --------------------
# "Normaliseer X door te delen door het aantal karakters in de dictionary."
# Hierdoor worden integer indices geschaald naar [0,1], wat training stabiliseert.

X = X.astype(np.float32) / float(vocab_size)
print("X na normalisatie — dtype:", X.dtype, "min:", X.min(), "max:", X.max())


X na normalisatie — dtype: float32 min: 0.0 max: 0.9931507


In [15]:
# STAP 5: ONE-HOT ENCODING VAN Y
# ------------------------------
# "Converteer y naar een one-hot encoded representatie."
# Outputdimensie = vocab_size (|V|), dus to_categorical(..., num_classes=vocab_size).

y = to_categorical(Y_int, num_classes=vocab_size).astype(np.float32)
print("y shape:", y.shape, "(verwacht: (n, vocab_size))")


y shape: (1116692, 146) (verwacht: (n, vocab_size))


In [16]:
# STAP 6: MODELDEFINITIE (SEQUENTIAL)
# -----------------------------------
# Lagen (exact volgens opdracht):
# 1) LSTM(256, input_shape=(m,1), return_sequences=True)
# 2) Dropout(0.2)
# 3) LSTM(256)
# 4) Dropout(0.1)
# 5) Dense(vocab_size, activation='softmax')

model = Sequential([
    LSTM(256, input_shape=(m, 1), return_sequences=True),
    Dropout(0.2),
    LSTM(256),
    Dropout(0.1),
    Dense(vocab_size, activation='softmax')
])

model.summary()


  super().__init__(**kwargs)


In [17]:
# STAP 7: COMPILE
# ---------------
# "Compile met de adam optimizer en categorical_crossentropy als loss-functie."

model.compile(optimizer="adam", loss="categorical_crossentropy")
print("Model gecompileerd (optimizer=adam, loss=categorical_crossentropy).")


Model gecompileerd (optimizer=adam, loss=categorical_crossentropy).


In [18]:
# STAP 8: TRAINING + CHECKPOINT
# -----------------------------
# "Fit het model ... 20 epochs, batch_size 128. Gebruik ModelCheckpoint(..., monitor='loss',
#  save_best_only=True, mode='min')."
#
# NB: Dit kan op CPU lang duren. Met GPU (Colab) gaat het per epoch sneller.
# We monitoren 'loss' (train loss), zoals expliciet gevraagd.

checkpoint_cb = ModelCheckpoint(
    filepath=BEST_CKPT_PATH,
    monitor='loss',
    verbose=1,
    save_best_only=True,
    mode='min'
)

EPOCHS = 20
BATCH_SIZE = 128

history = model.fit(
    X, y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[checkpoint_cb]
)


Epoch 1/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 2.8677
Epoch 1: loss improved from inf to 2.67094, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 2.8677
Epoch 2/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 2.3728
Epoch 2: loss improved from 2.67094 to 2.30957, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 2.3728
Epoch 3/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 2.1621
Epoch 3: loss improved from 2.30957 to 2.12581, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 30ms/step - loss: 2.1620
Epoch 4/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 2.0420
Epoch 4: loss improved from 2.12581 to 2.01639, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 2.0420
Epoch 5/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.9557
Epoch 5: loss improved from 2.01639 to 1.93749, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m323s[0m 30ms/step - loss: 1.9556
Epoch 6/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.8953
Epoch 6: loss improved from 1.93749 to 1.88127, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 1.8953
Epoch 7/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.8483
Epoch 7: loss improved from 1.88127 to 1.83784, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.8483
Epoch 8/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.8130
Epoch 8: loss improved from 1.83784 to 1.80390, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.8130
Epoch 9/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.7848
Epoch 9: loss improved from 1.80390 to 1.77620, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.7848
Epoch 10/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.7603
Epoch 10: loss improved from 1.77620 to 1.75190, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.7603
Epoch 11/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.7386
Epoch 11: loss improved from 1.75190 to 1.73164, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 30ms/step - loss: 1.7386
Epoch 12/20
[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - loss: 1.7196
Epoch 12: loss improved from 1.73164 to 1.71470, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.7196
Epoch 13/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.7041
Epoch 13: loss improved from 1.71470 to 1.69885, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.7041
Epoch 14/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6876
Epoch 14: loss improved from 1.69885 to 1.68333, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.6876
Epoch 15/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6762
Epoch 15: loss improved from 1.68333 to 1.67229, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.6762
Epoch 16/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6633
Epoch 16: loss improved from 1.67229 to 1.65947, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 1.6633
Epoch 17/20
[1m8724/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6518
Epoch 17: loss improved from 1.65947 to 1.64793, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.6518
Epoch 18/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6410
Epoch 18: loss improved from 1.64793 to 1.63844, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 1.6410
Epoch 19/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6323
Epoch 19: loss improved from 1.63844 to 1.62886, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 30ms/step - loss: 1.6323
Epoch 20/20
[1m8723/8725[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 30ms/step - loss: 1.6219
Epoch 20: loss improved from 1.62886 to 1.61961, saving model to checkpoints/lstm_char_best.h5




[1m8725/8725[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 30ms/step - loss: 1.6219


In [19]:
# STAP 9: BESTE MODEL LADEN + RANDOM STARTSEQUENCE
# ------------------------------------------------
# "Laad het model met de beste loss en compile dit. Kies een random startsequence uit X en print deze."

best_model = load_model(BEST_CKPT_PATH)
best_model.compile(optimizer="adam", loss="categorical_crossentropy")
print("Beste model geladen uit:", BEST_CKPT_PATH)

# Kies een random seedsequence uit de trainingsdata (als ints, lengte 100)
start_idx = np.random.randint(0, n_sequences)
seed_seq = X_int[start_idx].tolist()  # let op: X_int zijn *ints* (pre-normalisatie)
seed_text = "".join(int_to_char[i] for i in seed_seq)

print("\n=== Startsequence (100 chars) ===")
print(seed_text)




Beste model geladen uit: checkpoints/lstm_char_best.h5

=== Startsequence (100 chars) ===
ge thy servant, and the greeks destroy.”

thus chryses pray’d:—the favouring power attends,
and from


In [20]:
# STAP 10: GENEREREN VAN 1000 KARAKTERS
# -------------------------------------
# Voor elke stap:
# - Reshape huidige sequence naar (1, m, 1) en normaliseer (/ vocab_size)
# - Predict softmax, neem argmax → integer
# - Converteer int → char en print zonder newline
# - Schuif window: laatste 99 + nieuw char

generated = []
pattern = list(seed_seq)  # kopie van de startsequence (ints)

for step in range(GEN_STEPS):
    x_in = np.array(pattern, dtype=np.float32).reshape(1, m, 1) / float(vocab_size)
    proba = best_model.predict(x_in, verbose=0)[0]  # (vocab_size,)
    next_idx = int(np.argmax(proba))                # argmax volgens opdracht
    next_char = int_to_char[next_idx]
    print(next_char, end="")                        # geen newline
    generated.append(next_char)
    pattern = pattern[1:] + [next_idx]             # schuif window 1 naar rechts

print("\n\n=== Klaar met genereren ===")



 the shades of the shores of fate.
the shout a shout the shades of the shore,
and the shoued shades of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a force of the shores of fate.
the shout a forc

In [22]:
from google.colab import files
files.download("checkpoints/lstm_char_best.h5")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Observaties, tekortkomingen, verbeteringen

ik zie dat mijn model korte patronen zoals the shout en the shades goed oppakt
het kan dus echt leren hoe woorden gespeld moeten worden en hoe zinnen beginnen
maar na een tijdje gaat het steeds hetzelfde stuk herhalen
het blijft hangen in the shout a force of the shores of fate
dat komt omdat ik bij het genereren steeds argmax gebruik
dan kies ik altijd het meest waarschijnlijke karakter en dat maakt de tekst heel voorspelbaar
mijn conclusie is dat de LSTM wel lokale structuur leert maar moeite heeft met lange afstand
ik zou dit kunnen verbeteren door sampling met temperature te gebruiken of top k sampling
dan wordt de output minder repetitief en creatiever
