# Sci-Fi Story Generator using RNN (LSTM)
## Final Project - NLP for Creatives (Peckham DAZ Programme 2024/25)
### 🔍 Project Overview

This project explores how Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, can be used to generate sci-fi stories. A dataset of public domain science fiction texts is used to train a language model that learns how to predict the next word in a sequence and generate creative story snippets.


In [None]:
!pip install nltk tensorflow

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
from nltk.tokenize import word_tokenize
import re
import requests


[33mDEPRECATION: Loading egg at /opt/anaconda3/lib/python3.12/site-packages/huggingface_hub-0.29.2-py3.8.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [3]:
# Load from Project Gutenberg (sci-fi book)
url = "http://www.gutenberg.org/cache/epub/2147/pg2147.txt"
response = requests.get(url)
text = response.text

# Basic cleaning
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
nltk.download('punkt')
tokens = word_tokenize(text)

print("Total tokens:", len(tokens))
print("First 50 tokens:", tokens[:50])


NameError: name 'requests' is not defined

In [None]:
# Create tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokens)
total_words = len(tokenizer.word_index) + 1

# Create sequences
input_sequences = []
seq_length = 50

for i in range(seq_length, len(tokens)):
    seq = tokens[i-seq_length:i+1]
    encoded = tokenizer.texts_to_sequences([seq])[0]
    input_sequences.append(encoded)

# Pad and split into X and y
input_sequences = np.array(pad_sequences(input_sequences, maxlen=seq_length+1))
X = input_sequences[:, :-1]
y = to_categorical(input_sequences[:, -1], num_classes=total_words)

print("Vocabulary size:", total_words)
print("Input shape:", X.shape)


In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=seq_length))
model.add(LSTM(128, return_sequences=True))
model.add(LSTM(128))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


In [None]:
model.fit(X, y, epochs=20, batch_size=128, verbose=1)


In [None]:
⚠️ Note: Training might take a while, especially without GPU.


In [None]:
def generate_text(seed_text, next_words=50):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=seq_length, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted)
        output_word = tokenizer.index_word.get(predicted_index, '')
        seed_text += " " + output_word
    return seed_text


In [None]:
seed = "in the year 3000 a group of astronauts discovered"
story = generate_text(seed, next_words=50)
print("✨ Generated Sci-Fi Text:\n", story)


### 🧠 Reflections and Ethical Notes

- **Learning experience:** I learned how to build and train a language model using RNNs.
- **Challenges:** Memory usage and training time were the main challenges. Using Google Colab helped.
- **Limitations:** The model sometimes repeats phrases or generates grammatically incorrect text.
- **Ethics:** I used public domain data. The model is intended for creative storytelling, not factual information.
- **LLM Disclaimer:** I used ChatGPT to assist in structuring this project, coding, and documentation. All code was tested and understood by me.
