<a href="https://colab.research.google.com/github/Shufen-Yin/Artificial-Intelligence/blob/main/Assignment_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
# Import Libraries
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical


In [7]:
# Task 1 Dataset Preparation:
# Dataset Choice
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

# Load a sample text (e.g., Jane Austen's Emma)
text = gutenberg.raw('austen-emma.txt')

# Check first 500 characters
print(text[:500])


# Load a sample text (e.g., Jane Austen)
text = gutenberg.raw('austen-emma.txt')
print(text[:1000])


[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died 

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [8]:
#2 Text Preprocessing
import re

# Basic text cleaning
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)


In [9]:
# Dataset Visualization / Stats
from collections import Counter
words = text.split()
print("Total words:", len(words))
print("Unique words:", len(set(words)))
print("Sample:", words[:50])


Total words: 158128
Unique words: 9306
Sample: ['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', 'and', 'had', 'lived', 'nearly', 'twentyone', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', 'she', 'was']


**Task 2 Exploring GPTs: Model Architecture**

# 1 Transformer Model Overview

GPT (Generative Pre-trained Transformer) is based on the Transformer architecture (Vaswani et al., 2017), which replaced RNNs for sequence modeling.

Key features:

Self-Attention Mechanism: Each token in the input sequence “attends” to other tokens, capturing context efficiently.

Stacked Transformer Blocks: Each block has:

Multi-head self-attention

Feed-forward neural network

Layer normalization and residual connections

Decoder-only Architecture (GPT): GPT uses only the transformer decoder stack for causal language modeling, meaning it predicts the next token based on previous tokens.

# Text Generation Process

Tokenization: Converts text into tokens (words, subwords, or characters) using methods like Byte Pair Encoding (BPE).

Probability Distribution: GPT outputs a probability for each token in the vocabulary for the next position.

Sequence Generation:

Start with a seed text (prompt).

Iteratively sample the next token (using greedy, top-k, or nucleus sampling).

Append it to the sequence and repeat until desired length is reached.

Key idea: GPT predicts one token at a time using contextual embeddings from previous tokens.

In [10]:
# 2  Training
# Implementing a Basic Text Generation Model
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Example: corpus
corpus = text.lower().split()  #  split into words

# Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1



In [11]:
# Convert corpus tokens into token ids
seq = [tokenizer.word_index[w] for w in corpus if w in tokenizer.word_index]

# Choose a max sequence length
max_seq_len = 40

input_sequences = []

# Sliding window to create sequences of fixed length
for i in range(max_seq_len, len(seq)):
    input_sequences.append(seq[i-max_seq_len:i+1])  # 40 tokens + 1 label token

# Convert to numpy array
input_sequences = np.array(input_sequences)

# Split into inputs and labels
X = input_sequences[:, :-1]   # 40 token
y = input_sequences[:, -1]    # token predict


In [6]:
# One-hot label

#from tensorflow.keras.utils import to_categorical

# y = to_categorical(y, num_classes=total_words)


In [13]:
#Create LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_seq_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()




In [3]:
# Train model
model.fit(X, y, epochs=50, batch_size=64)

NameError: name 'model' is not defined

# Task 3 Application Demonstration: Content Generation with LSTM
1️ Description (for report)

The trained LSTM model can generate new text sequences based on a seed phrase. This demonstrates a content creation application, where the model predicts one word at a time to continue a text in the style of Jane Austen.

Steps:

1 The user provides a seed text (e.g., "emma woodhouse was").

2 The model predicts the next word using the learned probabilities from the training data.

3 The predicted word is appended to the seed text.

4 Steps 2–3 are repeated for a specified number of words (next_words).

5 The output is a new, generated paragraph that follows the style and vocabulary of the original text.

This can be used for:

Generating creative writing content.

Extending stories or chapters automatically.

Text augmentation for NLP tasks.

In [8]:
# Implementation
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def generate_text(seed_text, next_words, model, tokenizer, max_seq_len):
    """
    Generate text using the trained LSTM model.

    Parameters:
    seed_text (str) : initial text to start generation
    next_words (int) : number of words to generate
    model : trained LSTM model
    tokenizer : fitted Keras tokenizer
    max_seq_len (int) : sequence length used during training

    Returns:
    str : generated text
    """
    output_text = seed_text
    for _ in range(next_words):
        # Convert seed_text to sequence of token IDs
        token_list = [tokenizer.word_index[w] for w in seed_text.lower().split() if w in tokenizer.word_index]

        # Pad sequence
        token_list = pad_sequences([token_list], maxlen=max_seq_len, padding='pre')

        # Predict next word
        predicted_probs = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted_probs, axis=-1)[0]

        # Find corresponding word
        next_word = [word for word, index in tokenizer.word_index.items() if index == predicted][0]

        # Append to text
        seed_text += " " + next_word
        output_text += " " + next_word

    return output_text

# -------------------------------
# Example usage
# -------------------------------
seed = "emma woodhouse was"
generated_text = generate_text(seed_text=seed, next_words=20, model=model, tokenizer=tokenizer, max_seq_len=max_seq_len-1)
print("Generated Text:\n", generated_text)


NameError: name 'model' is not defined