<a href="https://colab.research.google.com/github/AAdewunmi/Next-Word-Prediction-Project/blob/main/Predict_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict Words!

```markdown
# Predict Words — Notebook README (Quick Guide)

A compact guide to train and use a next-word text generator (Keras LSTM) directly from this notebook.

---

## What this notebook does
- **Preprocesses** a plain-text corpus (*Plato’s The Republic*, public domain) with light normalization.
- **Builds sequences** of tokens for next-word prediction.
- **Trains** a small Keras LSTM language model (default: `Embedding(50) → LSTM(50) → Dense`).
- **Saves & reloads** artifacts for reuse: `nextWord.h5`, `tokenizer.pkl`, `metadata.json`, and `republic_sequences.txt`.
- **Generates text** from a seed using greedy or sampling (temperature / top-k) decoding.

---

## Folder layout & key files
By default (Colab + Drive):
```

/content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/
data/
republic.txt
republic_sequences.txt
models/
tokenizer.pkl
metadata.json
nextWordPredict/
nextWord.h5

````
> If you’re running locally, update the `DRIVE_BASE` / `PROJECT_ROOT` path constants in the notebook.

---

## Requirements
- Python 3.10+ (Colab is fine)
- TensorFlow/Keras, NLTK, tqdm, numpy

Install (if needed) and download NLTK data:
```python
!pip -q install tensorflow nltk tqdm
import nltk; nltk.download('punkt'); nltk.download('stopwords')
````

---

## Quick start (Colab)

1. **Mount Drive**

```python
from google.colab import drive
drive.mount('/content/drive')
```

2. **Run cells in order**

* **Setup & utils** → text cleaning, I/O helpers
* **Sequence building** → creates `republic_sequences.txt`
* **Model training** → trains & saves `nextWord.h5`, `tokenizer.pkl`, `metadata.json`
* **Inference** → loads assets and generates text

3. **Generate text** (sampling example)

```python
generated = generate_seq_sampling(
    model, tokenizer, seq_length, seed_text,
    n_words=30, temperature=0.9, top_k=50, repetition_penalty=1.15
)
print(generated)
```

---

## Inputs & outputs

* **Input** corpus: `data/republic.txt` (plain text).
* **Training sequences**: built from the corpus; `seq_length` is inferred from these sequences.
* **Seed text** for inference should match the model’s expected **sequence length − 1** tokens.

  * Use the helper `pick_seed_from_sequences(...)` or supply your own seed (trim to the required length).
* **Output**: continuation of `n_words` predicted tokens.

---

## Default hyperparameters (tunable)

```text
embedding_dim = 50
lstm_units   = 50
batch_size   = 128
epochs       = 50
```

> For better quality, consider increasing `embedding_dim`, `lstm_units`, training data size, and epochs.

---

## Reuse the trained model

Load artifacts and generate without retraining:

```python
model, tokenizer, seq_length, meta = load_nextword_assets(
    model_path=MODEL_PATH, tokenizer_path=TOKENIZER_PATH, seqs_path=SEQS_PATH
)
seed = pick_seed_from_sequences(SEQS_PATH, seq_length)
print(generate_seq_sampling(model, tokenizer, seq_length, seed, n_words=30))
```

---

## Customization tips

* **Different corpus**: replace `republic.txt`, then re-run preprocessing → training.
* **Stricter cleaning**: adjust the normalization pipeline (URLs, HTML, casing, stopwords).
* **Decoding style**: switch between greedy (`generate_seq`) and sampling (`generate_seq_sampling`), play with `temperature` and `top_k`.

---

## Troubleshooting

* **File not found**: confirm Drive is mounted and paths match your environment.
* **Tokenizer/model mismatch**: ensure `tokenizer.pkl` and `nextWord.h5` come from the **same training run**.
* **NLTK resource errors**: run the `nltk.download(...)` lines above.
* **OOM / slow training**: lower `batch_size`, shorten sequences, or use a smaller model.

---

## Attribution

* Text: *Plato — The Republic* (public domain).
* Libraries: TensorFlow/Keras, NLTK, tqdm, numpy.

In [31]:
# Mount google drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [32]:
# Import libraries

import string
import nltk
import re
from nltk.corpus import stopwords
import pkg_resources
import pickle
import json
from tqdm.notebook import tqdm
from nltk.tokenize import word_tokenize

In [33]:
# Import libraries

import numpy
from numpy import array
from pickle import dump
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [34]:
# Add utility functions for document handling and sample data directory detection

# --- Drive-backed, project-scoped document & model utilities ---
from pathlib import Path
from typing import List, Union, Tuple, Optional
import string
import json
import pickle
import platform
import time

# 0) Drive mount (must be done in a previous cell)
DRIVE_MOUNT = Path("/content/drive")
DRIVE_MYDRIVE = DRIVE_MOUNT / "MyDrive"

def _require_drive() -> None:
    """Raise if Google Drive isn't mounted in Colab."""
    if not DRIVE_MYDRIVE.exists():
        raise RuntimeError(
            "Google Drive is not mounted at /content/drive. "
            "Run: from google.colab import drive; drive.mount('/content/drive')"
        )

def _project_root_dir() -> Path:
    """
    Project root inside MyDrive.
    Uses the exact folder you specified:
      MyDrive/Colab Notebooks/Predict-Words-Analysis
    """
    _require_drive()
    root = DRIVE_MYDRIVE / "Colab Notebooks" / "Predict-Words-Analysis"
    root.mkdir(parents=True, exist_ok=True)
    return root

# 1) Project-scoped directories
PROJECT_ROOT = _project_root_dir()
DATA_DIR     = PROJECT_ROOT / "data"                 # *.txt
MODELS_DIR   = PROJECT_ROOT / "models"               # *.pkl, metadata.json
NEXTWORD_DIR = MODELS_DIR / "nextWordPredict"        # *.keras / *.h5
for d in (DATA_DIR, MODELS_DIR, NEXTWORD_DIR):
    d.mkdir(parents=True, exist_ok=True)

def _resolve_path(filename: Union[str, Path]) -> Path:
    """
    Resolve a filename into the correct project folder based on extension:
      *.txt   -> DATA_DIR
      *.pkl   -> MODELS_DIR
      *.keras/*.h5 -> NEXTWORD_DIR
      otherwise -> PROJECT_ROOT
    Absolute paths are returned as-is.
    """
    p = Path(filename)
    if p.is_absolute():
        return p
    suffix = p.suffix.lower()
    if suffix == ".txt":
        return DATA_DIR / p.name
    if suffix == ".pkl":
        return MODELS_DIR / p.name
    if suffix in (".keras", ".h5"):
        return NEXTWORD_DIR / p.name
    return PROJECT_ROOT / p.name

# 2) Document I/O
def load_doc(filename: str) -> str:
    """
    Read a UTF-8 text file.
    Relative names go to .../Predict-Words-Analysis/data/.
    """
    path = _resolve_path(filename)
    with path.open("r", encoding="utf-8") as f:
        return f.read()

def clean_doc(doc: str) -> List[str]:
    """
    Convert raw document text into cleaned, lowercased, alphabetic tokens.

    Steps:
      1) Replace double hyphens with a space.
      2) Split on whitespace.
      3) Remove ASCII punctuation from each token.
      4) Keep only purely alphabetic tokens (isalpha()).
      5) Lowercase all tokens.
    """
    doc = doc.replace("--", " ")
    table = str.maketrans("", "", string.punctuation)
    tokens = [w.translate(table) for w in doc.split()]
    tokens = [w.lower() for w in tokens if w.isalpha()]
    return tokens

def save_doc(lines: List[str], filename: str) -> None:
    """
    Save a list of strings to disk, one per line (UTF-8).
    Relative *.txt files go to .../data/.
    """
    path = _resolve_path(filename)
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        f.write("\n".join(lines))

# 3) Optional: model/tokenizer persistence in your structure
def persist_nextword_assets(model, tokenizer, *, model_name: str = "nextWord") -> Tuple[Path, Path, Path]:
    """
    Save model (.keras) to models/nextWordPredict/, tokenizer (.pkl) and metadata.json to models/.
    Returns (model_path, tokenizer_path, metadata_path).
    """
    # Defer import so this file doesn't require TF unless you call this
    from tensorflow.keras.models import load_model as _  # noqa: F401

    NEXTWORD_DIR.mkdir(parents=True, exist_ok=True)
    MODELS_DIR.mkdir(parents=True, exist_ok=True)

    model_path = NEXTWORD_DIR / f"{model_name}.keras"
    tokenizer_path = MODELS_DIR / "tokenizer.pkl"
    metadata_path = MODELS_DIR / "metadata.json"

    model.save(model_path)
    with tokenizer_path.open("wb") as f:
        pickle.dump(tokenizer, f)

    # Try to infer seq_length from model.input_shape[1] if present
    seq_len = None
    try:
        ish = getattr(model, "input_shape", None)
        if isinstance(ish, (list, tuple)) and len(ish) >= 2 and isinstance(ish[1], int):
            seq_len = int(ish[1])
    except Exception:
        pass

    meta = {
        "created_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "seq_length": seq_len,
        "vocab_size": len(getattr(tokenizer, "word_index", {})) + 1,
        "python_version": platform.python_version(),
    }
    with metadata_path.open("w", encoding="utf-8") as f:
        json.dump(meta, f, ensure_ascii=False, indent=2)

    return model_path, tokenizer_path, metadata_path

def load_nextword_assets(model_name: str = "nextWord") -> Tuple[object, object, dict]:
    """
    Load (model, tokenizer, metadata) from:
      models/nextWordPredict/<model_name>.h5
      models/tokenizer.pkl
      models/metadata.json
    """
    from tensorflow.keras.models import load_model
    model_path = NEXTWORD_DIR / f"{model_name}.h5"
    tokenizer_path = MODELS_DIR / "tokenizer.pkl"
    metadata_path = MODELS_DIR / "metadata.json"

    if not model_path.exists():
        raise FileNotFoundError(f"Model not found: {model_path}")
    if not tokenizer_path.exists():
        raise FileNotFoundError(f"Tokenizer not found: {tokenizer_path}")

    model = load_model(model_path, compile=False)
    with tokenizer_path.open("rb") as f:
        tokenizer = pickle.load(f)
    meta = {}
    if metadata_path.exists():
        with metadata_path.open("r", encoding="utf-8") as f:
            meta = json.load(f)
    return model, tokenizer, meta



In [35]:
# Add script to tokenize text file from sample_data and save tokens

INPUT_FILE = "republic.txt"
OUTPUT_FILE = "republic-tokenised.txt"

# Execute the pipeline
text = load_doc(INPUT_FILE)
tokens = clean_doc(text)
save_doc(tokens, OUTPUT_FILE)

print(f"Read from: {INPUT_FILE}")
print(f"Wrote  to: {OUTPUT_FILE}")
print(f"Sample tokens: {tokens[:25]}")
print(f"Total tokens: {len(tokens):,}")



Read from: republic.txt
Wrote  to: republic-tokenised.txt
Sample tokens: ['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'republic', 'by', 'plato', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other']
Total tokens: 209,695


In [36]:
"""
Cell — Sanity checks for Drive-backed document pipeline
-------------------------------------------------------
Verifies:
  • Google Drive is mounted.
  • `text` and `tokens` look sane.
  • Output file exists at the resolved Drive path.
  • save/load round-trip works in the project data folder.
Assumes you've already run the Drive-backed helpers (with _require_drive/_resolve_path).
"""

from pathlib import Path

# Ensure Drive is mounted and project folders exist
_require_drive()

# Basic object checks (expects you already computed `text`, `tokens`, and set `OUTPUT_FILE`)
assert isinstance(text, str) and text.strip(), "Input text is empty or not a string."
assert isinstance(tokens, list) and all(isinstance(t, str) for t in tokens), "Tokens must be a list of strings."
assert all(t.isalpha() for t in tokens), "Non-alphabetic tokens slipped through."

# Output file must exist in Drive (resolve relative names into your project layout)
out_path = _resolve_path(OUTPUT_FILE)
assert out_path.exists(), f"Output file was not written to Drive: {out_path}"

# Round-trip write/read in Drive data folder
tmp_out = _resolve_path("_tmp_tokens.txt")   # goes to .../data/_tmp_tokens.txt
save_doc(["A", "b", "c"], tmp_out)
reloaded = load_doc(tmp_out).splitlines()
assert reloaded == ["A", "b", "c"], "save_doc/load_doc round-trip failed."
tmp_out.unlink(missing_ok=True)

print("Sanity checks (Drive) passed.")



Sanity checks (Drive) passed.


In [37]:
# Text I/O and Cleaning Utilities
# ========================================

# Defines reusable helpers for:
  # • File I/O for plain-text corpora (`load_doc`, `save_doc`)
  # • Document tokenization for large files (`clean_doc`)
  # • Social-text normalization for short messages (tweets/posts) via
    # explicit, testable steps (`strip_html`, `strip_urls`, `strip_emails`,
    # `keep_letters_only`, `remove_roman_numerals`, `normalize_whitespace`,
    # high-level `clean_social_text`, corpus-level `clean_social_corpus`,
    # and a simple whitespace tokenizer `tokenize_simple`)
  # • Environment detection for a writable sample-data directory (`SAMPLE_DIR`)
    # supporting both local Jupyter and Google Colab patterns.

# --- Social text cleaning helpers (fits alongside load_doc / clean_doc / save_doc) ---
import re
from typing import List, Iterable

# Optional progress bar; falls back to a no-op if tqdm isn't available
try:
    from tqdm.auto import tqdm  # type: ignore
except Exception:  # pragma: no cover
    def tqdm(x):  # type: ignore
        return x

# Pre-compile patterns once
_HTML_TAGS_RE   = re.compile(r"<.*?>")
_URL_RE         = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)
_EMAIL_RE = re.compile(
    r'\b(?:mailto:)?(?:at\s+)?[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b',
    flags=re.IGNORECASE,
)
_NON_LETTERS_RE = re.compile(r"[^A-Za-z]+")   # ASCII letters only; see note below
_ROMAN_RE       = re.compile(r"\b[MDCLXVI]+\b\.?", flags=re.IGNORECASE)
_WW_RE          = re.compile(r"ww+", flags=re.IGNORECASE)  # catch stray 'www' fragments
_WS_RE          = re.compile(r"\s+")

def strip_html(text: str) -> str:
    """
    Remove HTML tags from text.
    """
    return _HTML_TAGS_RE.sub("", text)

def strip_urls(text: str) -> str:
    """
    Remove URLs (http[s]:// and bare www.*) from text.
    """
    return _URL_RE.sub("", text)

def strip_emails(text: str) -> str:
    """
    Remove email addresses and a preceding 'at ' if present.

    Examples:
        'Email me at jane@x.com'  -> 'Email me'
        'Contact jane@x.com now'  -> 'Contact now'
    """
    return _EMAIL_RE.sub("", text)


def remove_roman_numerals(text: str) -> str:
    """
    Remove standalone Roman numerals (I, IV, XIV, etc.), optionally with trailing period.
    """
    return _ROMAN_RE.sub("", text)


def keep_letters_only(text: str) -> str:
    """
    Replace any non-letter character with a space (A–Z only).
    Note: this strips digits, punctuation, emojis, and diacritics.
    """
    return _NON_LETTERS_RE.sub(" ", text)

def normalize_whitespace(text: str) -> str:
    """
    Collapse multiple spaces/newlines to a single space and trim edges.
    """
    return _WS_RE.sub(" ", text).strip()

def clean_social_text(text: str, *, letters_only: bool = True, lowercase: bool = True) -> str:
    """
    Clean a single social post/message.

    Pipeline:
      1) Strip HTML tags
      2) Remove URLs
      3) Remove emails
      4) (Optional) keep only letters (A-Z), replacing others with spaces
      5) Lowercase
      6) Remove 'www' fragments and standalone Roman numerals
      7) Normalize whitespace

    Args:
        text: Raw input text.
        letters_only: If True, drop non-letters (digits, punctuation, emojis).
        lowercase: If True, lowercase the text.

    Returns:
        Cleaned text as a single string.
    """
    if text is None:
        return ""

    x = strip_html(text)
    x = strip_urls(x)
    x = strip_emails(x)
    if letters_only:
        x = keep_letters_only(x)
    if lowercase:
        x = x.lower()
    # Misc cleanups mirroring your original intent
    x = _WW_RE.sub("", x)           # remove leftover www/ww fragments
    x = remove_roman_numerals(x)    # drop roman numerals like 'XIV'
    x = normalize_whitespace(x)
    return x

def tokenize_simple(text: str) -> List[str]:
    """
    Basic whitespace tokenizer for already-cleaned text.
    """
    if not text:
        return []
    return text.split()

def clean_social_corpus(
    texts: Iterable[str],
    *,
    to_tokens: bool = False,
    show_progress: bool = True,
    letters_only: bool = True,
    lowercase: bool = True,
) -> List[List[str]] | List[str]:
    """
    Clean a collection of social texts and optionally tokenize.

    Args:
        texts: Iterable of raw texts (e.g., tweets, comments).
        to_tokens: If True, return List[List[str]] (tokens per text). If False, return cleaned strings.
        show_progress: If True, show a progress bar when tqdm is available.
        letters_only: Keep only letters (A-Z) before tokenization.
        lowercase: Lowercase text before tokenization.

    Returns:
        If to_tokens is False: List[str] of cleaned strings.
        If to_tokens is True:  List[List[str]] of tokenized strings per input text.
    """
    it = tqdm(texts) if show_progress else texts
    if to_tokens:
        return [tokenize_simple(clean_social_text(t, letters_only=letters_only, lowercase=lowercase)) for t in it]
    else:
        return [clean_social_text(t, letters_only=letters_only, lowercase=lowercase) for t in it]



In [38]:
# Example corpus (replace with your own list of tweets/messages)

raw_texts = [
    "<p>Check this out: https://example.com GREAT DEAL!!!</p>",
    "Email me at John.Doe@example.org or visit www.mysite.org",
    "We met on XIV. It was fun :)",
    "Hello—World! New\nline\tand\ttabs.",
]

cleaned = clean_social_corpus(raw_texts, to_tokens=False, show_progress=False)
tokenized = clean_social_corpus(raw_texts, to_tokens=True, show_progress=False)

print("Cleaned strings:")
for s in cleaned:
    print("  ", s)

print("\nTokenized (per text):")
for toks in tokenized:
    print("  ", toks)


Cleaned strings:
   check this out great deal
   email me or visit
   we met on it was fun
   hello world new line and tabs

Tokenized (per text):
   ['check', 'this', 'out', 'great', 'deal']
   ['email', 'me', 'or', 'visit']
   ['we', 'met', 'on', 'it', 'was', 'fun']
   ['hello', 'world', 'new', 'line', 'and', 'tabs']


In [39]:
# Sanity tests to catch regressions quickly

def _assert_equal(a, b, msg=""):
    assert a == b, f"{msg}\nExpected: {b}\nActual:   {a}"

# 1) URL & HTML stripping
sample1 = "<b>Deal</b> at https://x.y/z and www.foo.com"
out1 = clean_social_text(sample1)
_assert_equal(out1, "deal at and", "URL/HTML removal failed")

# 2) Email removal
sample2 = "Contact a@b.co now! or A.B-c_d@domain.io later."
out2 = clean_social_text(sample2)
_assert_equal(out2, "contact now or later", "Email removal failed")

# 3) Roman numerals dropping
sample3 = "This is Chapter XIV. And Section vi."
out3 = clean_social_text(sample3)
_assert_equal(out3, "this is chapter and section", "Roman numeral removal failed")

# 4) Letters-only + whitespace normalization
sample4 = "Hello—World! New\nline\tand\ttabs. #hashtag 123"
out4 = clean_social_text(sample4)
_assert_equal(out4, "hello world new line and tabs hashtag", "Letters-only/whitespace failed")

# 5) Corpus path (clean strings)
raws = ["Email me: joe@x.com", "Visit <i>www.example.com</i> TODAY!!"]
cleaned = clean_social_corpus(raws, to_tokens=False, show_progress=False)
_assert_equal(cleaned, ["email me", "visit today"], "Corpus cleaning failed")

# 6) Corpus path (tokens)
tokenized = clean_social_corpus(raws, to_tokens=True, show_progress=False)
_assert_equal(tokenized, [["email", "me"], ["visit", "today"]], "Corpus tokenization failed")

print("All social-text cleaning tests passed.")


All social-text cleaning tests passed.


In [40]:
# Implement littleCleaning function to filter sentences by length

def littleCleaning(sentences):
    print("Starting cleaning Process")
    ret_list = []
    for sentence in sentences:
      words = sentence.split(" ")
      if len(words) > 5:
        ret_list.append(sentence)
      else:
        continue
    return ret_list

In [41]:
# Download necessary NLTK data files (wordnet, punkt)

nltk.download('wordnet')

nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [42]:
# Load and preprocess 'republic.txt' corpus

# Uses the Drive-backed helpers set up earlier.

text = load_doc("republic.txt").lower()   # resolves to /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/data/republic.txt
print(f"length of the corpus: {len(text):,}")

length of the corpus: 1,174,387


In [43]:
# Converting the data into lists

data_list = text.split(".")
data_list[:20]

['the project gutenberg ebook of the republic, by plato\n\nthis ebook is for the use of anyone anywhere in the united states and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever',
 ' you may copy it, give it away or re-use it under the terms\nof the project gutenberg license included with this ebook or online at\nwww',
 'gutenberg',
 'org',
 ' if you are not located in the united states, you\nwill have to check the laws of the country where you are located before\nusing this ebook',
 '\n\ntitle: the republic\n\nauthor: plato\n\ntranslator: b',
 ' jowett\n\nrelease date: october, 1998 [ebook #1497]\n[most recently updated: september 11, 2021]\n\nlanguage: english\n\n\nproduced by: sue asscher and david widger\n\n*** start of the project gutenberg ebook the republic ***\n\n\n\n\nthe republic\n\nby plato\n\ntranslated by benjamin jowett\n\nnote: see also “the republic” by plato, jowett, ebook #150\n\n\ncontents\n\n introduction and analysis',
 '\n the 

In [44]:
# --- Normalization pipeline that uses social-text utilities ---

from typing import Callable, Iterable, List, Union

def normalization_pipeline(
    texts: Iterable[str],
    *,
    to_tokens: bool = False,
    postprocess: Callable[[List[str]], List[str]] | None = None,
    show_progress: bool = True,
    letters_only: bool = True,
    lowercase: bool = True,
) -> Union[List[str], List[List[str]]]:
    """
    Normalize a collection of short texts using Cell 1 social-text utilities.

    - Uses `clean_social_corpus` for HTML/URL/email stripping, letters-only, lowercasing,
      roman-numeral removal, and whitespace normalization.
    - Returns strings by default (`to_tokens=False`) or tokens (`to_tokens=True`).
    - Will only apply `postprocess` if you pass it explicitly.
    """
    print("Starting Normalization Process")
    cleaned_or_tokens = clean_social_corpus(
        texts,
        to_tokens=to_tokens,
        show_progress=show_progress,
        letters_only=letters_only,
        lowercase=lowercase,
    )

    # Only apply postprocess if explicitly provided
    if callable(postprocess):
        if to_tokens:
            # If your postprocess expects strings, join first.
            try:
                joined = [" ".join(toks) for toks in cleaned_or_tokens]  # type: ignore[arg-type]
                maybe = postprocess(joined)
                cleaned_or_tokens = maybe if maybe is not None else joined  # type: ignore[assignment]
            except Exception as e:
                raise TypeError(
                    "Postprocess failed on tokenized data. "
                    "Provide a postprocess that accepts List[List[str]] or join tokens yourself."
                ) from e
        else:
            maybe = postprocess(cleaned_or_tokens)  # type: ignore[arg-type]
            # Guard against in-place functions that return None
            if maybe is not None:
                cleaned_or_tokens = maybe  # type: ignore[assignment]

    print("Normalization Process Finished")
    return cleaned_or_tokens



In [45]:
# pro_sentences: list of cleaned strings (default, matches your previous pipeline)

pro_sentences = normalization_pipeline(
    data_list,         # your existing list of raw texts
    to_tokens=False,   # keep strings to stay compatible with littleCleaning
    show_progress=False
)

pro_sentences[:5]

Starting Normalization Process
Normalization Process Finished


['the project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever',
 'you may copy it give it away or re use it under the terms of the project gutenberg license included with this ebook or online at',
 'gutenberg',
 'org',
 'if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook']

In [46]:
# Tokenize and preprocess data list

pro_tokens = normalization_pipeline(
    data_list,
    to_tokens=True,    # returns List[List[str]]
    show_progress=False
)
pro_tokens[:2]

Starting Normalization Process
Normalization Process Finished


[['the',
  'project',
  'gutenberg',
  'ebook',
  'of',
  'the',
  'republic',
  'by',
  'plato',
  'this',
  'ebook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'in',
  'the',
  'united',
  'states',
  'and',
  'most',
  'other',
  'parts',
  'of',
  'the',
  'world',
  'at',
  'no',
  'cost',
  'and',
  'with',
  'almost',
  'no',
  'restrictions',
  'whatsoever'],
 ['you',
  'may',
  'copy',
  'it',
  'give',
  'it',
  'away',
  'or',
  're',
  'use',
  'it',
  'under',
  'the',
  'terms',
  'of',
  'the',
  'project',
  'gutenberg',
  'license',
  'included',
  'with',
  'this',
  'ebook',
  'or',
  'online',
  'at']]

In [47]:
# Add unit tests for normalization_pipeline

def _assert_equal(a, b, msg=""):
    assert a == b, f"{msg}\nExpected: {b}\nActual:   {a}"

_demo = [
    "<b>Deal</b> at https://x.y/z and www.foo.com #promo",
    "Email me at Jane.Doe@example.org ASAP — thanks!",
]

# Strings out
out = normalization_pipeline(_demo, to_tokens=False, show_progress=False)
_assert_equal(out, ["deal at and promo", "email me asap thanks"], "String normalization failed")

# Tokens out
out_tok = normalization_pipeline(_demo, to_tokens=True, show_progress=False)
_assert_equal(out_tok, [["deal", "at", "and", "promo"], ["email", "me", "asap", "thanks"]], "Token normalization failed")

print("Normalization pipeline tests passed.")





Starting Normalization Process
Normalization Process Finished
Starting Normalization Process
Normalization Process Finished
Normalization pipeline tests passed.


In [48]:
# Check processed sentence count

len(pro_sentences)

7012

In [49]:
# Structuring the text into a paragraph

dataText = "".join(pro_sentences[: 700])
dataText[: 200]

'the project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions what'

In [50]:
# turn a doc into clean tokens

def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [51]:
# Tokenize and analyze corpus statistics

tokens = clean_doc(dataText)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'republic', 'by', 'plato', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoeveryou', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'atgutenbergorgif', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebooktitle', 'the', 'republic', 'author', 'plato', 'translator', 'bjowett', 'release', 'date', 'october', 'ebook', 'most', 'recently', 'updated', 'september', 'language', 'english', 'produced', 'by', 'sue', 'asscher', 'and', 'david', 'widger', 'start', 'of'

In [52]:
# Implement sequence creation for language modeling

length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 18332


In [53]:
# Implement utility function save_doc for writing sequences to file

# --- Persist training sequences to Drive using the project helpers ---

# If sequences are already strings, this is a no-op; if they are lists/tuples of tokens,
# we join them into space-separated lines.
lines = [
    " ".join(seq) if isinstance(seq, (list, tuple)) else str(seq)
    for seq in sequences
]

OUTPUT_SEQS = "republic_sequences.txt"   # goes to .../MyDrive/Colab Notebooks/Predict-Words-Analysis/data/
save_doc(lines, OUTPUT_SEQS)

print("Wrote:", _resolve_path(OUTPUT_SEQS))

Wrote: /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/data/republic_sequences.txt


In [54]:
# Verify file exists and peek a couple of lines

p = _resolve_path(OUTPUT_SEQS)
assert p.exists(), f"Expected file at {p}"
preview = load_doc(OUTPUT_SEQS).splitlines()[:2]
print("Preview:", preview)

Preview: ['the project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoeveryou may copy it give it away or re use it under the terms', 'project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoeveryou may copy it give it away or re use it under the terms of']


In [55]:
# Implement data preparation and tokenization pipeline.

# Consolidated Imports

import numpy
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential # Use tensorflow namespace
from tensorflow.keras.layers import Dense, LSTM, Embedding # Use tensorflow namespace

in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [56]:
#

from pathlib import Path
p = Path(in_filename) if in_filename.startswith("/content/") else _resolve_path(in_filename)
assert p.exists(), f"File not found: {p}"


In [57]:
"""
Cell — Train LSTM next-word model and persist artifacts to Google Drive
----------------------------------------------------------------------
Prereqs:
  • You have already computed: X (np.ndarray), y (np.ndarray one-hot),
    seq_length (int), vocab_size (int), tokenizer (fitted Keras Tokenizer).
  • Drive is mounted:
        from google.colab import drive
        drive.mount('/content/drive')

Outputs (Drive):
  • Model (.keras):  MyDrive/Colab Notebooks/Predict-Words-Analysis/models/nextWordPredict/nextWord.keras
  • Tokenizer (.pkl): MyDrive/Colab Notebooks/Predict-Words-Analysis/models/tokenizer.pkl
  • Metadata (.json):  MyDrive/Colab Notebooks/Predict-Words-Analysis/models/metadata.json
"""

from pathlib import Path
import json, pickle, time, platform

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.utils import to_categorical

# --------- Preflight checks (fail fast with clear errors) ----------
required = {
    "X": "numpy.ndarray of shape (n_samples, seq_length)",
    "y": "numpy.ndarray one-hot of shape (n_samples, vocab_size)",
    "seq_length": "int (timesteps used during training)",
    "vocab_size": "int (len(tokenizer.word_index)+1)",
    "tokenizer": "fitted keras.preprocessing.text.Tokenizer",
}
for name in required:
    if name not in globals():
        raise RuntimeError(f"Missing variable `{name}`. Expected: {required[name]}")
if not isinstance(seq_length, int) or seq_length <= 0:
    raise ValueError(f"Bad seq_length: {seq_length}")
if not isinstance(vocab_size, int) or vocab_size <= 1:
    raise ValueError(f"Bad vocab_size: {vocab_size}")
if not hasattr(tokenizer, "word_index"):
    raise TypeError("`tokenizer` doesn’t look like a fitted Keras Tokenizer.")

# Optional additional shape checks
assert X.ndim == 2 and X.shape[1] == seq_length, f"X shape mismatch: {X.shape}, seq_length={seq_length}"
assert y.ndim == 2 and y.shape[1] == vocab_size, f"y shape mismatch: {y.shape}, vocab_size={vocab_size}"

# --------- Model definition (matches your architecture) ----------
embedding_dim = 50
lstm_units = 50
dense_units = 50

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=seq_length))
model.add(LSTM(lstm_units, return_sequences=True))
model.add(LSTM(lstm_units))
model.add(Dense(dense_units, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))

# Ensure a concrete input shape (optional)
model.build(input_shape=(None, seq_length))
print(model.summary())

# --------- Compile & Train ----------
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 128
epochs = 50
history = model.fit(X, y, batch_size=batch_size, epochs=epochs)

# --------- Persist artifacts to Google Drive ----------
drive_root   = Path("/content/drive/MyDrive")
project_root = drive_root / "Colab Notebooks" / "Predict-Words-Analysis"
models_dir   = project_root / "models"
nw_dir       = models_dir / "nextWordPredict"

nw_dir.mkdir(parents=True, exist_ok=True)
models_dir.mkdir(parents=True, exist_ok=True)

model_path     = nw_dir / "nextWord.h5"
tokenizer_path = models_dir / "tokenizer.pkl"
metadata_path  = models_dir / "metadata.json"

# Save model
model.save(model_path)

# Save tokenizer (pickle)
with tokenizer_path.open("wb") as f:
    pickle.dump(tokenizer, f)

# Save minimal metadata for reproducibility
meta = {
    "created_at_utc": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    "seq_length": int(seq_length),
    "vocab_size": int(vocab_size),
    "embedding_dim": int(embedding_dim),
    "lstm_units": int(lstm_units),
    "dense_units": int(dense_units),
    "python_version": platform.python_version(),
}
with metadata_path.open("w", encoding="utf-8") as f:
    json.dump(meta, f, ensure_ascii=False, indent=2)

# --------- Sanity: existence + quick reload test (lightweight) ----------
assert model_path.exists(), f"Model not saved: {model_path}"
assert tokenizer_path.exists(), f"Tokenizer not saved: {tokenizer_path}"
assert metadata_path.exists(), f"Metadata not saved: {metadata_path}"

# Optional: quick load to ensure files aren’t corrupt
from tensorflow.keras.models import load_model as _load_model
_ = _load_model(model_path, compile=False)  # model reload sanity
with tokenizer_path.open("rb") as f:
    _tok = pickle.load(f)
assert len(getattr(_tok, "word_index", {})) == len(tokenizer.word_index), "Tokenizer mismatch on reload."

print("Saved:")
print("  Model    :", model_path)
print("  Tokenizer:", tokenizer_path)
print("  Metadata :", metadata_path)




None
Epoch 1/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 124ms/step - accuracy: 0.0641 - loss: 7.2641
Epoch 2/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 100ms/step - accuracy: 0.0858 - loss: 6.1945
Epoch 3/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 98ms/step - accuracy: 0.0841 - loss: 6.0787
Epoch 4/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 113ms/step - accuracy: 0.1155 - loss: 5.9815
Epoch 5/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 98ms/step - accuracy: 0.1234 - loss: 5.8724
Epoch 6/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 99ms/step - accuracy: 0.1267 - loss: 5.7721
Epoch 7/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 98ms/step - accuracy: 0.1318 - loss: 5.5946
Epoch 8/50
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 97ms/step - accuracy: 0.1379 - loss: 5.4837
Epoch 9/50
[1m1



Saved:
  Model    : /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/models/nextWordPredict/nextWord.h5
  Tokenizer: /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/models/tokenizer.pkl
  Metadata : /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/models/metadata.json


In [58]:
# Implement text generation function using a trained Keras model

import numpy as np

def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        # yhat = model.predict_classes(encoded, verbose=0)
        predict_x=model.predict(encoded)
        yhat=np.argmax(predict_x,axis=1)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [59]:
# Load text sequences from file and determine sequence length

in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

print(len(lines))
print(lines[0])

18332
the project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoeveryou may copy it give it away or re use it under the terms


In [60]:
"""
Cell — Load next-word model/tokenizer from Google Drive (MyDrive) and generate text
-----------------------------------------------------------------------------------
Assumes:
  • Drive is mounted:
        from google.colab import drive
        drive.mount('/content/drive')
  • Files are stored under:
        /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/data/republic_sequences.txt
        /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/models/tokenizer.pkl
        /content/drive/MyDrive/Colab Notebooks/Predict-Words-Analysis/models/nextWordPredict/nextWord.h5
"""

from pathlib import Path
import pickle
import numpy as np
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ---- Project-scoped Drive paths ----
DRIVE_BASE   = Path("/content/drive/MyDrive")
PROJECT_ROOT = DRIVE_BASE / "Colab Notebooks" / "Predict-Words-Analysis"
DATA_DIR     = PROJECT_ROOT / "data"
MODELS_DIR   = PROJECT_ROOT / "models"
NEXTWORD_DIR = MODELS_DIR / "nextWordPredict"

MODEL_PATH     = NEXTWORD_DIR / "nextWord.h5"
TOKENIZER_PATH = MODELS_DIR   / "tokenizer.pkl"
SEQS_PATH      = DATA_DIR     / "republic_sequences.txt"

# ---- Preflight: ensure Drive + files exist ----
assert DRIVE_BASE.exists(), (
    "Google Drive not mounted at /content/drive/MyDrive. "
    "Run: from google.colab import drive; drive.mount('/content/drive')"
)
for p in (MODEL_PATH, TOKENIZER_PATH, SEQS_PATH):
    assert p.exists(), f"Missing required file: {p}"

def load_assets(model_path: Path, tokenizer_path: Path):
    """
    Load the trained next-word model and its matching tokenizer.

    Args:
        model_path: Absolute path to the .keras model file in Drive.
        tokenizer_path: Absolute path to the pickled Keras Tokenizer.

    Returns:
        (model, tokenizer)
    """
    model = load_model(model_path, compile=False)
    with tokenizer_path.open("rb") as f:
        tokenizer = pickle.load(f)
    return model, tokenizer

def infer_seq_length(model=None, sequences_path: Path | None = None) -> int:
    """
    Infer the training sequence length.

    Priority:
      1) model.input_shape[1] if present
      2) mode(line_length) - 1 from the sequences file (lines are usually seq_length+1)
    """
    # 1) From model
    if model is not None and isinstance(model.input_shape, (list, tuple)):
        if len(model.input_shape) >= 2 and isinstance(model.input_shape[1], int):
            return int(model.input_shape[1])

    # 2) From sequences file
    if sequences_path is None:
        raise ValueError("Need a model with input_shape or a sequences file to infer seq_length.")
    lengths = []
    with sequences_path.open("r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            line = line.strip()
            if not line:
                continue
            lengths.append(len(line.split()))
            if i > 5000:  # sample is enough to get the mode
                break
    if not lengths:
        raise ValueError("Sequences file appears empty.")
    values, counts = np.unique(lengths, return_counts=True)
    modal_len = int(values[np.argmax(counts)])
    return modal_len - 1

def generate_seq_sampling(
    model,
    tokenizer,
    seq_length: int,
    seed_text: str,
    n_words: int = 20,
    *,
    temperature: float = 0.9,
    top_k: int | None = 50,
    top_p: float | None = None,   # e.g. 0.9 (nucleus); use either top_k or top_p
    repetition_penalty: float = 1.1,  # >1.0 discourages repeats
    recent_window: int = 20
) -> str:
    """
    Sample next words with temperature + top-k/top-p and a light repetition penalty.

    Args:
        model, tokenizer: your trained pair (tokenizer must match the model).
        seq_length: timesteps expected by the model.
        seed_text: initial text (will be trimmed/padded to seq_length).
        n_words: how many tokens to generate.
        temperature: >1.0 = more random, <1.0 = more conservative. Typical 0.7–1.0.
        top_k: keep only the k highest-prob tokens before sampling (set None to disable).
        top_p: keep smallest set whose cumulative prob ≥ p (nucleus sampling). Use None if using top_k.
        repetition_penalty: >1.0 reduces probability of recently used tokens.
        recent_window: how many recent tokens to penalize.
    """
    assert not (top_k and top_p), "Use either top_k or top_p, not both."
    idx_to_word = getattr(tokenizer, "index_word", {})
    word_to_idx = tokenizer.word_index

    def _sample_id(probs: np.ndarray, recent_ids: list[int]) -> int:
        # temperature scaling (operate in log-space to avoid underflow)
        logits = np.log(probs + 1e-9) / max(temperature, 1e-6)
        probs_t = np.exp(logits)
        probs_t /= probs_t.sum()

        # repetition penalty on recent ids
        if repetition_penalty and recent_ids:
            for tid in set(recent_ids[-recent_window:]):
                probs_t[tid] /= repetition_penalty
            probs_t = np.clip(probs_t, 0, None)
            s = probs_t.sum()
            if s > 0:
                probs_t /= s

        # top-k filter
        if top_k and top_k > 0:
            idxs = np.argpartition(probs_t, -top_k)[-top_k:]
            p = probs_t[idxs]
            p = p / p.sum()
            return int(np.random.choice(idxs, p=p))

        # top-p (nucleus) filter
        if top_p and 0 < top_p < 1:
            sort_idx = np.argsort(-probs_t)
            sort_p = probs_t[sort_idx]
            cumsum = np.cumsum(sort_p)
            cutoff = np.searchsorted(cumsum, top_p, side="right") + 1
            idxs = sort_idx[:cutoff]
            p = probs_t[idxs]
            p = p / p.sum()
            return int(np.random.choice(idxs, p=p))

        # plain multinomial sampling
        return int(np.random.choice(len(probs_t), p=probs_t))

    in_text = seed_text.strip()
    recent_ids: list[int] = []
    for _ in range(n_words):
        enc = tokenizer.texts_to_sequences([in_text])[0]
        enc = pad_sequences([enc], maxlen=seq_length, truncating="pre")
        probs = model.predict(enc, verbose=0)[0]  # softmax over vocab
        next_id = _sample_id(probs, recent_ids)
        next_word = idx_to_word.get(next_id)
        if not next_word:
            break
        in_text += " " + next_word
        recent_ids.append(next_id)
    return in_text

def pick_seed_from_sequences(seqs_path: Path, seq_length: int) -> str:
    """
    Pick a seed from the training sequences file and trim to seq_length tokens.
    """
    with seqs_path.open("r", encoding="utf-8") as f:
        lines = [l.strip() for l in f if l.strip()]
    idx = np.random.randint(len(lines))
    seed_line = lines[idx]
    return " ".join(seed_line.split()[:seq_length])

# ---- Load assets ----
model, tokenizer = load_assets(MODEL_PATH, TOKENIZER_PATH)

# ---- Derive sequence length ----
try:
    seq_length = infer_seq_length(model=model)
except Exception:
    seq_length = infer_seq_length(model=None, sequences_path=SEQS_PATH)

# ---- Sanity: tokenizer vocab should not exceed embedding input_dim (if present) ----
emb_input_dim = next((getattr(l, "input_dim", None) for l in model.layers if hasattr(l, "input_dim")), None)
vocab_size = len(getattr(tokenizer, "word_index", {})) + 1
assert emb_input_dim is None or vocab_size <= emb_input_dim, (
    f"Tokenizer vocab ({vocab_size}) exceeds model embedding input_dim ({emb_input_dim}). "
    "Likely a mismatched tokenizer/model pair."
)

# ---- Pick a seed and generate ----
seed_text = pick_seed_from_sequences(SEQS_PATH, seq_length)
print("SEED:", seed_text, "\n")

generated = generate_seq_sampling(
    model, tokenizer, seq_length, seed_text,
    n_words=30, temperature=0.9, top_k=50, repetition_penalty=1.15
)
print(generated)



SEED: the association of ideas and which does not interfere with the general purposewhat kind or degree of unity is to be sought after in a building in the plastic arts in poetry in prose is a problem which has to be determined relatively to the subject matterto plato himself the 

the association of ideas and which does not interfere with the general purposewhat kind or degree of unity is to be sought after in a building in the plastic arts in poetry in prose is a problem which has to be determined relatively to the subject matterto plato himself the name and in a new kind of punishment the author is the end of the fair character of human last endured probably aims as you at the veil of heaven


In [None]:
"""
Cell — Export artifacts for Django (macOS Intel / TF 2.10 compatible)
---------------------------------------------------------------------
Writes:
  • H5 model     → .../models/nextWordPredict/nextWord.h5
  • Tokenizer    → .../models/tokenizer.pkl
  • Metadata     → .../models/metadata.json
Also does a quiet reload sanity check with compile=False.
"""

from pathlib import Path
import json, pickle, time, platform
from tensorflow.keras.models import load_model as _load_model

# --- Pre-reqs in memory: model, tokenizer, seq_length, vocab_size ---
assert "model" in globals(), "model not found"
assert "tokenizer" in globals(), "tokenizer not found"
assert "seq_length" in globals() and isinstance(seq_length, int)
assert "vocab_size" in globals() and isinstance(vocab_size, int)

# --- Drive project paths (adjust if yours differ) ---
drive_root   = Path("/content/drive/MyDrive")
project_root = drive_root / "Colab Notebooks" / "Predict-Words-Analysis"
models_dir   = project_root / "models"
nw_dir       = models_dir / "nextWordPredict"
nw_dir.mkdir(parents=True, exist_ok=True)
models_dir.mkdir(parents=True, exist_ok=True)

h5_path       = nw_dir / "nextWord.h5"         # Django/TF 2.10 will load this
tokenizer_path= models_dir / "tokenizer.pkl"
metadata_path = models_dir / "metadata.json"

# --- Save artifacts ---
model.save(h5_path)  # HDF5 (Keras 2.x friendly)
with tokenizer_path.open("wb") as f:
    pickle.dump(tokenizer, f)

meta = {
    "created_at_utc": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    "seq_length": int(seq_length),
    "vocab_size": int(vocab_size),
    "export_format": "h5",
    "python_version": platform.python_version(),
}
with metadata_path.open("w", encoding="utf-8") as f:
    json.dump(meta, f, ensure_ascii=False, indent=2)

# --- Quiet reload sanity (compile=False avoids warnings; not needed in Django) ---
_ = _load_model(h5_path, compile=False)
with tokenizer_path.open("rb") as f:
    _tok = pickle.load(f)
assert len(_tok.word_index) == len(tokenizer.word_index)

print("Exported for Django:")
print("  H5 model  :", h5_path)
print("  Tokenizer :", tokenizer_path)
print("  Metadata  :", metadata_path)
