In [None]:
"""
# Project Overview and Explanation

This project demonstrates my capability to build and train a simple text generation model using a custom dataset. The dataset is one that I have personally curated and uploaded to the Hugging Face platform, allowing me to perform specific operations without requiring additional visualizations or pre-processing steps that might otherwise be necessary.

### Dataset Description
The dataset is titled `sid0608/AI-Story-Telling-Platform` and contains multiple genres of story content. For this demonstration, I have focused specifically on the 'Romance' genre. By working with a dataset that I am familiar with, I was able to perform targeted data extraction and cleaning, which forms the basis for training the text generation model. I selected a subset of the data based on word count and then further refined it to ensure a manageable input for the model.

### Model Description
The model itself is a simple Recurrent Neural Network (RNN) with a GRU (Gated Recurrent Unit) layer, which is suitable for handling sequential data like text. The architecture includes:
1. **Embedding Layer:** This converts the input characters into dense vectors of fixed size, making it easier for the model to learn.
2. **GRU Layer:** This layer processes the sequence of vectors, learning dependencies and patterns within the text data.
3. **Dense Layer:** This final layer maps the GRU outputs to the vocabulary size, predicting the next character in the sequence.

### Training and Inference
The model is trained on the curated dataset for one epoch due to time constraints. It is saved using checkpoints to allow for reloading and continued training or inference without starting from scratch. During inference, I load the model with a batch size of 1 to generate text based on a given starting string.

### Limitations and Future Improvements
This is not my finest work, and I am aware of the limitations and areas where the model can be improved:
- **Overfitting:** The model might be overfitting to the training data due to the low number of epochs and high capacity.
- **Hyperparameter Tuning:** I have not extensively tuned the model's hyperparameters such as learning rate, dropout rate, and GRU units, which can be optimized for better performance.
- **Advanced Architectures:** Incorporating more advanced architectures like LSTMs, transformers, or using transfer learning with pre-trained models could significantly improve the text generation quality.

### Skill Demonstration
Despite these limitations, this project showcases my familiarity with essential concepts in deep learning, text processing, and model training. I am also well-versed in Python's object-oriented programming (OOP) concepts and can refactor this code into a modular, OOP-based structure with proper documentation if required.

This project is a demonstration of my skills and understanding of the topics within the time constraints, and I am confident in my ability to expand and refine this model further.
"""


In [1]:
!pip install tensorflow==2.12.0
#Restart Session after installing tensorflow==2.12.0

Collecting tensorflow==2.12.0
  Downloading tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.12.0)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12.0)
  Downloading keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting numpy<1.24,>=1.22 (from tensorflow==2.12.0)
  Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting tensorboard<2.13,>=2.12 (from tensorflow==2.12.0)
  Downloading tensorboard-2.12.3-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow-estimator<2.13,>=2.12.0 (from tensorflow==2.12.0)
  Downloading tensorflow_estimator-2.12.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting wrapt<1.15,>=1.11.0 (from tensorflow==2.12.0)
  Downloading wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.wh

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [2]:
from datasets import load_dataset
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
import re
import numpy as np
import os
from huggingface_hub import HfFolder
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd


In [22]:
# I am utilizing a dataset that I scraped earlier, which contains a substantial collection of stories of varying lengths.
# For this particular task, I will be focusing on a single story that consists of approximately 1.5 million words.
# These operations are performed on the data because I am familiar with its structure and content.
HfFolder.save_token("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")
dataset = load_dataset("sid0608/AI-Story-Telling-Platform", token = "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")
df = pd.DataFrame(dataset['train'])
df = df[df['Genre'] == 'Romance']
df['assistant_count'] = df['Assistant'].str.split().str.len()
df = df.sort_values(by='assistant_count', ascending=True)
data = df.iloc[791]['Assistant']
data_x = data.split()[5000:9000]
data = ' '.join(data_x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  romantic_dataset['assistant_count'] = romantic_dataset['Assistant'].str.split().str.len()


In [24]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s.,!?]', '', text)
    return text
data = clean_text(data)
vocab = sorted(set(data))
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)
seq_length = 100
tokenized_text = [char2idx[c] for c in data if c in char2idx]
sequences = [
    tokenized_text[i:i + seq_length + 1]
    for i in range(0, len(tokenized_text) - seq_length)
]
dataset = tf.data.Dataset.from_tensor_slices(sequences)
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text
BATCH_SIZE = 1
dataset = dataset.map(split_input_target).batch(BATCH_SIZE, drop_remainder=True)

In [25]:
vocab_size = len(vocab)
embedding_dim = 128
rnn_units = 256
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size,
                                  output_dim=embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform', kernel_regularizer=tf.keras.regularizers.l2(0.01), dropout = 0.3),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

optimizer = tf.keras.optimizers.Adam()
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.compile(optimizer=optimizer, loss=loss,  metrics=['accuracy'])
model.summary()
checkpoint_dir = './training_checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (1, None, 128)            3968      
                                                                 
 gru_2 (GRU)                 (1, None, 256)            296448    
                                                                 
 dense_2 (Dense)             (1, None, 31)             7967      
                                                                 
Total params: 308,383
Trainable params: 308,383
Non-trainable params: 0
_________________________________________________________________


In [26]:
EPOCHS = 1
trained_model = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])



In [27]:
inference_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
inference_model.load_weights('/content/training_checkpoints/ckpt_1.weights.h5')
inference_model.build(tf.TensorShape([1, None]))

In [29]:
def generate_text(model, start_string, num_generate=100, temperature=1.0):
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    model.reset_states()
    for _ in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])
    return (start_string + ''.join(text_generated))

string = "hi how are you"
input = string.lower()
print(generate_text(model, input, num_generate=500, temperature=0.5))

Temperature 0.5:
girl is cutelang anound the world forever if i were in your place i have always dreamed of seeing the whole world. christiana sighs quietly and glances and i were in your place i have always dreamed of seeing the whole world. christiana sighs quietly and glances if i were in your place i have always dreamed of seeing the whole world. christiana sighs quietly and glances qiietly and glances if i were in your place i have always dreamed of seeing the whole world. christiana sighs quietly and glance i world fo
