# Simple GenAI Model using NLP (Step by Step)
# 1. Problem Statement

Objective:
Build a Generative AI model that learns language patterns from a text dataset using NLP techniques and generates new text similar to the input data.

# 2. Dataset

A simple text file: data.txt

# 3. Import Required Libraries

In [5]:

import numpy as np
import re
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 4. Load Data

In [7]:

with open("data.txt", "r", encoding="utf-8") as file:
    text = file.read().lower()

# 5. NLP Preprocessing
Steps:

Lowercasing (already done)

Remove punctuation

Normalize text

In [9]:
text = re.sub(r"[^a-zA-Z\s]", "", text)
# is pure NLP preprocessing

# 6. Tokenization (NLP Step)
Convert words into numbers.

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

word_index = tokenizer.word_index
total_words = len(word_index) + 1

# 7. Create Training Sequences (Language Modeling)

In [14]:

input_sequences = []

for line in text.split("\n"):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

# 8. Padding Sequences

Make all sequences the same length.

In [16]:


max_len = max(len(seq) for seq in input_sequences)

input_sequences = pad_sequences(
    input_sequences,
    maxlen=max_len,
    padding="pre"
)

# 9. Split Input and Output

In [18]:

X = input_sequences[:, :-1]
y = input_sequences[:, -1]

y = np.eye(total_words)[y]   # One-hot encoding

# 10. Build GenAI Model (LSTM)

In [20]:

model = Sequential()
model.add(Embedding(total_words, 50, input_length=max_len-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation="softmax"))

model.compile(
    loss="categorical_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

This is where NLP + GenAI meet:

NLP gives structured text input

LSTM learns sequence patterns

Model predicts next word

# 11. Train the Model

In [22]:

model.fit(X, y, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.src.callbacks.History at 0x2087ee63850>

# 12. Text Generation Function (GenAI Output)

In [25]:

def generate_text(seed_text, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences(
            [token_list], maxlen=max_len-1, padding="pre"
        )
        predicted = model.predict(token_list, verbose=0)
        predicted_word = tokenizer.index_word[np.argmax(predicted)]
        seed_text += " " + predicted_word
    return seed_text

# 13. Generate New Text

In [27]:

print(generate_text("natural language", 5))

natural language learning learning is networks networks


# Why?

.pickle / .pkl → for scikit-learn models

Deep learning models (LSTM, GRU) → saved as .h5 or .keras

So for your GenAI + NLP LSTM model:

Model → .h5 / .keras

Tokenizer → .pickle

# 2. Save GenAI Model
Save trained LSTM model

In [30]:
model.save("text_gen_model.h5")

  saving_api.save_model(


# Save Tokenizer (Important NLP part)

In [32]:

import pickle

with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

# ------------------------------------

# 3. Loading Model for Deployment

In [35]:

from tensorflow.keras.models import load_model
import pickle

model = load_model("text_gen_model.h5")

with open("tokenizer.pkl", "rb") as f:
    tokenizer = pickle.load(f)

In [36]:
model

<keras.src.engine.sequential.Sequential at 0x20803b91330>

# --------------------------------------------------------

# 4. Simple Deployment Logic (Function Level)

In [39]:

def generate_text(seed_text, next_words=5):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences(
            [token_list], maxlen=max_len-1, padding="pre"
        )
        predicted = model.predict(token_list, verbose=0)
        seed_text += " " + tokenizer.index_word[np.argmax(predicted)]
    return seed_text

# 6. Example: Streamlit Deployment (Simple)

In [42]:

# import streamlit as st

st.title("Simple GenAI Text Generator")

input_text = st.text_input("Enter seed text")

if st.button("Generate"):
    output = generate_text(input_text, 10)
    st.write(output)