### Introduction to Text Generation
This notebook explains how we can split a given corpus of data info features and labels and then train a neural network to predict the next word in a sentence

1. Create a corpus - break the text down to list of sentences.
2. Create a word_index(vocabulary) from the text.
3. Tokenize the data and create n-GRAM sequence for each sequence of the corpus
4. Pad those sequences.
5. Segregate features from the sequences by reserving the last element of the array as labels

In [1]:
## import the required libraries and APIs
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)




  from .autonotebook import tqdm as notebook_tqdm


2.15.0


### Step 1: Create a corpus

In [2]:
data = "ince we set return_sequences=True in the LSTM layers, the output is now a three-dimension vector. If we input that into the Dense layer, it will raise an error because the Dense layer only accepts two-dimension input. In order to input a three-dimension vector, we need to use a wrapper layer called TimeDistributed. This layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end."

In [4]:
## instantiate tokenizer
tokenizer = Tokenizer()

## create corpus by lowering the letters and splitting the text by \n
corpus = data.lower().split(".")
print(corpus)

['ince we set return_sequences=true in the lstm layers, the output is now a three-dimension vector', ' if we input that into the dense layer, it will raise an error because the dense layer only accepts two-dimension input', ' in order to input a three-dimension vector, we need to use a wrapper layer called timedistributed', ' this layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end', '']


### Step 2: Train the tokenizer and create word encoding dictionary

In [8]:
tokenizer.fit_on_texts(corpus)

# calculate vocabulary size + 1 for <oov> token
vocab_size = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(vocab_size)

{'the': 1, 'we': 2, 'a': 3, 'layer': 4, 'in': 5, 'dimension': 6, 'input': 7, 'output': 8, 'three': 9, 'vector': 10, 'that': 11, 'dense': 12, 'will': 13, 'to': 14, 'ince': 15, 'set': 16, 'return': 17, 'sequences': 18, 'true': 19, 'lstm': 20, 'layers': 21, 'is': 22, 'now': 23, 'if': 24, 'into': 25, 'it': 26, 'raise': 27, 'an': 28, 'error': 29, 'because': 30, 'only': 31, 'accepts': 32, 'two': 33, 'order': 34, 'need': 35, 'use': 36, 'wrapper': 37, 'called': 38, 'timedistributed': 39, 'this': 40, 'help': 41, 'us': 42, 'maintain': 43, 'output’s': 44, 'shape': 45, 'so': 46, 'can': 47, 'achieve': 48, 'sequence': 49, 'as': 50, 'end': 51}
52


### Step 3: Create N-gram sequence

In [9]:
## create n-gram sequences of each text sequence
input_sequences = []
for line in corpus:
    tokens = tokenizer.texts_to_sequences([line])[0] # get all the tokens of the sequence
    for i in range(1, len(tokens)): # create n-gram sequences
        n_gram_sequence = tokens[: i+ 1]
        input_sequences.append(n_gram_sequence)

In [None]:
## pad sequences
max_seq_array = max([len(i) for i in input_sequences])
input_seq_array = np.array(pad_sequences(input_sequences))