GUID: 2660237
GitHub Link: ["Github"](https://github.com/Natasha-Warder/AI.Python.github.io.-).

# Generating Text with Neural Networks


# Getting the Data

As part of the code provided, the URL of the text file is downloaded and its contents are read into the variable shakespeare_text. You can use this for a variety of natural language processing tasks or text generation tasks.

In [None]:
import tensorflow as tf

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read() 

In [None]:
print(shakespeare_text[:80]) # By printing the first 80 characters of shakespeare_text, you can see how the text is structured and what it says. 

Inspecting a portion of the data is a good way to understand its structure and format. In addition to preprocessing data, it can be useful to get an idea of what the text looks like simply by looking at it.

# Preparing the Data

In [None]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character", # specifies that the text should be tokenized into individual characters 
                                                   standardize="lower") # specifies that the tokenized text should be in lowercase
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

TensorFlow's TextVectorization layer appears to be used to convert text data into numerical format, specifically using character-level tokenization. 

In [None]:
print(text_vec_layer([shakespeare_text]))

In [None]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [None]:
print(n_tokens, dataset_size)

The text is tokenized using the TextVectorization layer.
The vocabulary size is calculated, and the dataset is split into training, validation, and test sets using the to_dataset function.

In [None]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)  #creates a TensorFlow dataset from the input sequence.
    ds = ds.window(length + 1, shift=1, drop_remainder=True) #ensures that only complete windows are kept, discarding any partial windows.
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

If the shuffle parameter is True, the dataset is shuffled. The buffer size for shuffling is set to 100,000, and a seed is provided for reproducibility.

The function below is responsible for converting the sliced data into a TensorFlow dataset with the specified parameters:

In [None]:
length = 100
tf.random.set_seed(42) #if you run the code again with the same seed, you should get the same random results
train_set = to_dataset(encoded[:100_000], length=length, shuffle=True, #creates the training dataset
                       seed=42)
valid_set = to_dataset(encoded[100_000:160_000], length=length) #creates the validation dataset
test_set = to_dataset(encoded[160_000:], length=length) #creates the test dataset

length = 100: represents the preferred length of sequences datasets


train_set =...

It takes the first 100,000 elements of the encoded data and uses the to_dataset function to convert it into a TensorFlow dataset and shuffles data. 

valid_set =...

It takes elements 100,000 to 160,000 of the encoded data and converts it to a TensorFlow dataset and does not shuffle data. 

test_set =... 

It takes elements from index 160,000 to the end of the encoded data and converts it to a TensorFlow dataset and does not shuffle data. 



# Building and Training the Model
Building and training a model typically involves defining the model architecture, compiling the model, and then fitting it to the training data.

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16), # mapping the input tokens to vectors of size 16
    tf.keras.layers.GRU(128, return_sequences=True), # GRU is A Gated Recurrent Unit
    tf.keras.layers.Dense(n_tokens, activation="softmax") # output layer with a number of units equal to the number of tokens in the vocabulary
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint( # saves the model with the best validation accuracy during trainin
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])

The training process is monitored, and the best model based on validation accuracy is saved. After training, the history object contains information about the training and validation metrics over epochs.

In [None]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

This code extends the previously defined model by incorporating a text vectorization layer (text_vec_layer). A Lambda layer is also applied to subtract 2 from the input. 

# Generating Text

To generate text using the trained model, the basic idea of "text generation"is to input a seed sequence of text into the trained model. This means sampling the next character based on the model's predictions and adjusting the level of randomness through the use of temperature.

In [None]:
y_proba = shakespeare_model.predict(["To be or not to be"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2] #Converts the character ID back to the corresponding word using the text vectorization layer's vocabulary

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

In [None]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature # predicts the probabilities of the next character given the input sequence 
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [None]:
def extend_text(text, n_chars=50, temperature=1): #
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

# Adjusting Temperature & Output

In [None]:
print(extend_text("To be or not to be", temperature=0.01)) 

## Findings
The seed text is "To be or not to be," which serves as the starting point for text generation.The extend_text function generates additional text based on the seed text by iteratively predicting the next word/token and appending it to the generated text.The temperature parameter is set to a low value of 0.01. In the context of text generation, temperature controls the randomness of the predictions:
Lower temperature (close to 0) leads to more deterministic predictions.
Higher temperature introduces more randomness into the predictions.

### Output Analysis:
With a low temperature, the generated text is expected to exhibit more determinism. This means that the predictions are more likely to follow a specific pattern, resulting in a more structured and less varied output.

In [None]:
print(extend_text("To be or not to be", temperature=1)) # Moderate temperature (around 1) balances randomness and predictability.

## Findings
The seed text is still "To be or not to be," but the temperature parameter is set to 1, a higher temperature value. In text generation, the temperature controls the randomness of the generated text. A high temperature leads to more diverse and random predictions.
The function uses a trained model to predict the next tokens in the sequence based on the seed text. 

### Output Analysis:
With a high temperature, the function applies a larger factor to the predicted probabilities, making the distribution more uniform and introducing more randomness. This results in the model making less deterministic and more varied choices for the next tokens by incorporating the word 'be' inside of the word 'bear'. 

Therefore, although high temperature can result in more creative and diverse text but may sacrifice coherence and structure. The balance between temperature values allows you to control the trade-off between randomness and determinism in the generated text. 

Experimenting with different temperature settings can provide insights into the behavior of the model and help achieve the desired level of creativity or predictability in the generated text.

In [None]:
print(extend_text("To be or not to be", temperature=100)) # High temperature leads to more diverse and random predictions.

## Findings
The seed text is still "To be or not to be," but the temperature parameter is set to 100, which is a significantly higher temperature value.  A drastically higher temperature leads to extreme random predictions.

### Output Analysis:
With a high temperature, the function applies a larger factor to the predicted probabilities, making the distribution more uniform and introducing more randomness. This results in the model incomprehensible information for human readers 'To be or not to bef ,mt'&t3fpady-$
wh!nse?pws&ertj-vgerdq,!c-yje,znq' data which includes randomised punctuation and symbols. 

This emphaises high temperature sacrificing coherence and structure. The inbalance between temperature values loses control between randomness and determinism in the generated text. 

# Reflection

Firstly, building and training the model 'INVALID_ARGUMENT: You must feed a value for placeholder tensor '...' with dtype int32'occurs multiple times. The main theory to its delayed operation Model Saving/Loading Warning: untraced functions during model saving, functions in the model might not be traceable after loading. Equally errors could occur due to Input Shape Mismatch or Placeholder Name Mismatch. 

Secondly, generating text faced model architecture issues. Custom layers or operations may be defined correctly and placeholders may need explicit initialization. ultimately, I beleive its inability to function could be a product of input data mismatch. 

Input data provided to the TensorFlow model during training or inference matches the expected input shape and data type. This could be improved by inspecting the model's structure to identify the placeholder tensors mentioned in the error message: (gradients/split_2_grad/concat/split_2/split_dim, gradients/split_grad/concat/split/split_dim, gradients/split_1_grad/concat/split_1/split_dim).
in order to verify that these tensors are properly defined in this model, and their names match what the model is expecting.