# Kenyan Generative Literature with TensorFlow

This notebook trains a Generative AI model on a Kenyan-themed corpus to produce coherent, culturally rooted literature using TensorFlow and Keras. The model has been optimized to address issues with incoherent output by expanding the corpus, simplifying the architecture, and fine-tuning training and generation parameters.

**Goals:**
- Train a text generation model on Kenyan narratives
- Prevent overfitting with a simplified model and early stopping
- Generate coherent, diverse, and culturally relevant literary output


In [8]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import random
import re

In [9]:
# Expanded Kenyan-themed corpus with diverse, culturally rich sentences
kenyan_corpus = [
    "nairobi streets buzzed with matatus and chatter of street vendors",
    "turkana heat rose in waves off the red earth",
    "tea from kericho steamed beside dusty boots in nyamira",
    "elders gathered under baobab trees stories clinging to their breath",
    "matatus raced the wind down thika road horns blaring with purpose",
    "lake victorias fishermen sang into the dusk oars cutting rhythm",
    "kitengela cows grazed slowly unaware of the citys sprawl nearby",
    "in kibera laughter echoed between tin roofs and hopeful hearts",
    "mount kenya glowed orange at dawn silent and eternal",
    "samburu women danced beads flashing stories into the night",
    "nyama choma smoke curled into stars above eldorets fields",
    "children played under the mango tree chasing chickens and dreams",
    "the rhythm of benga music floated over kisumus waters",
    "a whisper of rain passed over kitui stirring red dust",
    "the market in gikomba never slept just changed hands and smells",
    "boda bodas swarmed like bees through the alleys of kisii",
    "in mombasa spices danced with ocean air in swahili kitchens",
    "the rift valley stretched like time itself beautiful fractured alive",
    "lanterns flickered during blackout dinners in machakos homes",
    "hope brewed with chai and morning prayers in nairobi flats",
    "the maasai warrior stood tall his red shuka vibrant against the savanna",
    "in lamu dhows sailed silently under a crescent moon",
    "the aroma of ugali and sukuma wiki filled kakamega homes",
    "drums echoed through the night in luo villages by lake victoria",
    "traders in wajir bargained fiercely under the scorching sun",
    "the nyika plains whispered tales of ancient migrations",
    "mangoes fell heavy and sweet in the coastal heat of malindi",
    "children in siaya laughed weaving kites from old newspapers",
    "the call to prayer mingled with seagulls in mombasas old town",
    "stars above tsavo burned brighter than any city light",
    "in nakuru flamingos painted lake nakuru pink at dawn",
    "the scent of roasted maize drifted through merus market stalls",
    "elders in kamba lands shared proverbs by the firelight",
    "bicycles creaked along the dusty paths of bungoma",
    "the ocean roared secrets to diani beaches at midnight",
    "tea pickers sang softly in the misty hills of kericho",
    "in garissa camel herds moved like shadows across the desert",
    "the sun set slow behind the ngong hills painting the sky gold",
    "market women in kisii balanced baskets of avocados with grace",
    "the beat of taarab music filled zanzibar street in mombasa",
    "children in turkana crafted toys from sticks and bottle caps",
    "the air in nyeri carried the scent of fresh coffee beans",
    "in marsabit winds sang through volcanic craters at dusk",
    "fishmongers in kisumu shouted prices as boats docked at dawn",
    "the acacia trees stood like sentinels over the maasai mara"
]

# Preprocess the corpus: lowercase and remove punctuation
kenyan_corpus = [re.sub(r'[^\w\s]', '', sentence.lower()) for sentence in kenyan_corpus]

In [10]:
# Tokenize and prepare sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(kenyan_corpus)
total_words = len(tokenizer.word_index) + 1

input_sequences = []
for line in kenyan_corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(2, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
X, y = input_sequences[:,:-1], input_sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

In [11]:
# Simplified model to prevent overfitting
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=total_words, output_dim=32, input_shape=(max_sequence_len-1,)),
    tf.keras.layers.LSTM(32),  # Single LSTM layer with reduced units
    tf.keras.layers.Dropout(0.3),  # Increased dropout
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(total_words, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [12]:
# Train with adjusted parameters
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
model.fit(X, y, epochs=50, batch_size=16, verbose=1, callbacks=[early_stop])

Epoch 1/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.0038 - loss: 5.6806
Epoch 2/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0580 - loss: 5.6698
Epoch 3/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.0446 - loss: 5.6287
Epoch 4/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0675 - loss: 5.4415
Epoch 5/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.0714 - loss: 5.2774
Epoch 6/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0492 - loss: 5.2514
Epoch 7/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0329 - loss: 5.2416
Epoch 8/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.0429 - loss: 5.2010
Epoch 9/50
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7e386495bed0>

In [13]:
# Optimized text generation function
def generate_kenyan_text(seed_text, next_words=25, diversity=0.5):
    result = seed_text.lower()  # Normalize seed text
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([result])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predictions = model.predict(token_list, verbose=0)[0]
        predictions = np.log(predictions + 1e-8) / diversity
        exp_preds = np.exp(predictions)
        predictions = exp_preds / np.sum(exp_preds)
        predicted = np.random.choice(range(total_words), p=predictions)
        output_word = ''
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        result += " " + output_word
    return result.capitalize()

In [14]:
# Generate text with a culturally relevant seed
seed = "the market in gikomba"
print("Generated Kenyan Literature:\n")
print(generate_kenyan_text(seed, next_words=20, diversity=0.5))

Generated Kenyan Literature:

The market in gikomba volcanic filled at and mombasas rhythm gold city gold changed and waters rhythm beads in savanna cutting gold victoria cutting
