## Part 1

In [1]:
# SIMPLE FUNCTION THAT CAN READ FILES
def read_file(filepath):
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

In [2]:
#read_file('moby_dick_four_chapters.txt')

In [3]:
# TOKENIZE AND CLEAN THE TEXT
import spacy 

# We only need Spacy for Tokenization
#nlp = spacy.load('en', disable = ['parser', 'tagger', 'ner'])
nlp = spacy.load('en_core_web_sm', disable = ['parser', 'tagger', 'ner'])

# Here we only use up to chapter 4 of moby dick (not all)
nlp.max_length = 1198623

So the idea is I'm going to read in this entire thing as a string, pass it in to NLP and then iterate through the tokens grabbing their text and then lowercasing it. <br>
However, I wanna get rid of things that are probably not gonna be very helpful for training purposes, like periods or new lines, because they show up so often in this actual text,
especially new lines, that I wanna make sure my text generation neural network doesn't overfit to that sort of punctuation. <br>
Otherwise, you may just get a bunch of periods or a bunch of new lines at the end since those are common enough that the neural network overfits to them. <br>
We're really interested in the relationship between words

Common Punctuation provided by CARIS

In [4]:
def seperate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [5]:
d = read_file('moby_dick_four_chapters.txt')

In [6]:
tokens = seperate_punc(d)



In [7]:
#tokens
len(tokens)

11338

Next, what we're going to do is create sequence of tokens. <br>
Basically Running the Model with 24 Words of a Sentence, and trying to predict the 25th Word. <br>
The Number of words that you want are dependent on the Document you want to Predict. <br>
You want the model to to grab the structure of a sentence but not short enough where you're missing general context. <br>
For example Song Lyrics could be shorter, and Shakespeare could be more than 50 words <br>

In [8]:
# 25 Words --> Network Predict # 26

train_len = 25 + 1

text_sequences = []

for i in range(train_len, len(tokens)):
    seq = tokens[i - train_len:i]   # i - train_len all the way to i

    text_sequences.append(seq)

In [9]:
type(text_sequences)

list

In [10]:
' '.join(text_sequences[0])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on'

In [11]:
' '.join(text_sequences[1])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore'

In [12]:
' '.join(text_sequences[2])

'ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i'

### Tokenizer
Replace Text to Unique Numbers / ID

In [13]:
from keras.preprocessing.text import Tokenizer

In [14]:
tokenizer = Tokenizer()
# Lots of parameters that you can put in Tokenizer

tokenizer.fit_on_texts(text_sequences)

In [15]:
sequences = tokenizer.texts_to_sequences(text_sequences)

In [16]:
# Replace Text to Unique Numbers / ID
#sequences[0]
#sequences[1]

In [17]:
#tokenizer.index_word

In [18]:
#for i in sequences[0]:
#    print(f"{i} : {tokenizer.index_word[i]}")

In [19]:
# tokenizer.word_counts

In [20]:
vocabulary_size = len(tokenizer.word_counts)
vocabulary_size

2718

Right now the Type of Sequences is a list where every item in the list is another list of these actual numbers. <br>
What I'd like to do is format that to be a numPy matrix.

In [21]:
type(sequences)

list

In [22]:
import numpy as np

sequences = np.array(sequences)

Each of these rows represents a single line in the text. <br>
Notice how we're essentially shifting one word over (956 14 263) the next one (14, 263, 51) <br>
<br>
So given these ID numbers for each word what is the expected word to come after those first 25 words?<br>
So we already have our features here as well as our label and later on we'll be performing a train test splitwith that functionality<br>
which is why we kind of needed it in this numPy array.<br>
<br>
Basically 25 + 1 = 26 <br>
25 are the Features, 1 Label <br>
The 25 Are being Trained to Predict the 1 Label <br>

In [23]:
sequences

array([[ 956,   14,  263, ..., 2713,   14,   24],
       [  14,  263,   51, ...,   14,   24,  957],
       [ 263,   51,  261, ...,   24,  957,    5],
       ...,
       [ 952,   12,  166, ...,  262,   53,    2],
       [  12,  166, 2712, ...,   53,    2, 2718],
       [ 166, 2712,    3, ...,    2, 2718,   26]])

## Part 2

### Features Label Split <br>
because there's nothing to test against. <br>
There's kind of no right answer as far as what text generated should look like instead, we are really just texting or testing these features against the predicted label.

In [24]:
from keras.utils import to_categorical

In [25]:
# FEATURES
# Grab for every row, all the columns EXPECT the very last column
X = sequences[:, :-1]   # row, columns

X

array([[ 956,   14,  263, ...,    6, 2713,   14],
       [  14,  263,   51, ..., 2713,   14,   24],
       [ 263,   51,  261, ...,   14,   24,  957],
       ...,
       [ 952,   12,  166, ...,   11,  262,   53],
       [  12,  166, 2712, ...,  262,   53,    2],
       [ 166, 2712,    3, ...,   53,    2, 2718]])

In [26]:
# LABELS
# Grab all the rows AND JUST grabbed the last column
y = sequences[: , -1]   # ro, columns

y

array([  24,  957,    5, ...,    2, 2718,   26])

In [27]:
y = to_categorical(y, num_classes = vocabulary_size + 1)    # The way Padding works, it need extra one to hold a zero

Sequences_Length <br> 
11,368 Sequences. Those are essentially the shifted 25 words sentences <br>
And in Each Sentences there are 25 Words

In [28]:
seq_len = X.shape[1]

X.shape

(11312, 25)

### Create The Models

we'll be importing a Dense Layer and LSTM Layer to deal with the sequences <br>
and an Embedding Layer to deal with the vocabulary <br>

In [29]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

In [30]:
def create_model(vocabulary_size, seq_len):

    model = Sequential()

    # EMBEDDING (See Notebook)
    # Check Parameter Desc -> (input_dim, output_dim, embeddings_initializer)
    model.add(Embedding(vocabulary_size, seq_len, input_length = seq_len))

    # LSTM Layers
    # For Neurons usually it is best to have 2x of Sequence Length 
    model.add(LSTM(50, return_sequences = True))
    model.add(LSTM(50))

    # Dense Layer
    # Here 50 Neuron is just to Make the Training Quicker, real Training should be HIGHER
    model.add(Dense(50, activation = 'relu'))

    model.add(Dense(vocabulary_size, activation = 'softmax'))

    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

    model.summary()

    return model

In [31]:
# +1 is to hold the extra zero
model = create_model(vocabulary_size + 1, seq_len)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 25)            67975     
                                                                 
 lstm (LSTM)                 (None, 25, 50)            15200     
                                                                 
 lstm_1 (LSTM)               (None, 50)                20200     
                                                                 
 dense (Dense)               (None, 50)                2550      
                                                                 
 dense_1 (Dense)             (None, 2719)              138669    
                                                                 
Total params: 244,594
Trainable params: 244,594
Non-trainable params: 0
_________________________________________________________________


### Train and Fit Our Models

In [32]:
# Save and Load the File Later on
from pickle import dump, load

The batch size is how many sequences you want to pass in at a time. You don't want to pass in everything at a time, <br>
otherwise the the neural network won't be able to handle that so you only want to pass in a certain amount of sequences, <br>
now 128 was a value that I kind of just chose arbitrarily and it worked well for me. <br>
<br>
Two epoch is not going to be nearly enough to generate any text that makes sense, so you may just see a bunch of the most common words <br>
like the, the,  the repeated but go ahead and train it on a little bit of epoch just so you can tell that it worked or not <br>
You should probably be training for at least like 200 epoch to get something that's reasonable. <br>
<br>
and then verbose one, it's just going to be the output report. <br>

In [33]:
model.fit(X, y, batch_size = 128, epochs = 2, verbose = 1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1e02d356fb0>

Again, our accuracy is absolutely horrible which makes sense. <br>
Right now it's probably just predicting the word to be the most common word <br>
but once this is done training,especially if you trained for a really long time, <br>
you're going to want to save this. <br>
<br>
the other thing we're actually going to save is the tokenizer as well <br>
Remember that tokenizer has information across the entire vocabulary like word counts, <br>

In [34]:
# SAVING THE MODEL
model.save('my_mobydick_model.h5')

In [35]:
# SAVING THE TOKENIZER
dump(tokenizer, open('my_simpletokenizer', 'wb'))

## Part 3
What's next is to actually generate new text. <br>
So what we're gonna do is we're going to create a function that generates new text for us based off a given model, <br>
tokenizer, sequence length, a seed text and then the number of words to be generated by the model. <br>

So we actually need to feed it some sort of line of 25 tokens that we wanna start off with and then it's gonna generate one word after that. <br>
Then we're what gonna do is chop off the very first word of the seed, take in our new word, put it at the end, <br>
and then we have our new ctext or our new input text after that. <br>
And then we're gonna keep doing that however many times the user wants to generate words. <br>
<br>

In [36]:
from keras_preprocessing.sequence import pad_sequences

And then we're gonna pass in our input text. And we're gonna grab the first item here because it basically returns a tuple or a list. <br>
<br>
we're gonna pad sequences to our trained rate. since we only trained on 25 tokens, we're gonna pad it to make sure it's only 25 tokens. <br>
Or if your ctext happens to be too short, then we're gonna pad it to fill up the 25 spaces.
<br>

In [41]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):

    output_text = []

    input_text = seed_text

    for i in range(num_gen_words):

        encoded_text = tokenizer.texts_to_sequences([input_text])[0]

        # Read pad_sequences Parameters Description
        pad_encoded = pad_sequences([encoded_text], maxlen = seq_len, truncating = 'pre')

        # Predict Class Probabilities for Each Word
        pred_word_ind = model.predict_classes(pad_encoded, verbose = 0)[0]

        pred_word = tokenizer.index_word[pred_word_ind]

        input_text += ' '+pred_word

        output_text.append(pred_word)

    return ' '.join(output_text)

So what is our generate text function doing? <br>
It's taking in the Model we just trained, the Tokenizer, which has knowledge about the vocabulary and what ID number goes with what word, <br>
the Sequence Length, some Seed Text you wanna start off with. And then the Number Of Words we wanna generate.<br>
<br>
And then let's say I wanna generate 10 words. So for I in range number of words, so I'm gonna do this 10 times, wanna generate 10 words <br>
I'm going to first take the input text string and encode it to be a sequence. <br>
Essentially, what we did earlier, we transformed those raw text data into sequences of numbers. <br>
Then if my ctext happens to be too short or too long, I may need to pad it, I may need to cut it off or I may need to add to it. (pad_encoded Function) <br>
<br>
After that, I'm going to predict the class probabilities for each word. So a model that predict classes is essentially going to throughout the entire vocabulary, <br>
assign a probability to the most likely next word. <br>
Next, we're gonna have the actual predicted word. So the way predict classes works when we index it with a zero, it's gonna return the index of that particular word. <br>
Essentially its ID which if we can call tokenizer index word from before, we just pass in that index and it matches with the actual word. <br>
<br>
Then we're gonna take in the input text and I'm going to add a space and then add on that predicted word. So if my input text in the very beginning was 25 words, <br>
after running this here for the first loop or the first pass on this for loop, it's now gonna be 26 words. Which means I'm then going to pad it. And that's why I'm gonna truncate with pre here. So it chops off the very first word. <br>
So essentially creating sequences as it goes along, but more and more, the sequenceis gonna be my predicted words. And if you make number generated words long enough <br>
eventually, you'll just be predicting on your own predicted words. <br>
<br>
So true generation without even any seed. Well, there is always a seed but after you do this enough times, if your number of generated words is longer than your ctext number of words, 
then you'll be predicting off your predicted words. <br>
Now we still wanna actually append that predicted word. So we'll say the output text and we'll append the predicted word. <br>
So this input text is for prediction purposes. This output text is all I'm actually gonna show. <br>

In [None]:
import random 
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

random_seed_text = text_sequences[random_pick]

#random_seed_text

In [39]:
seed_text = ' '.join(random_seed_text)
seed_text

"thought i to myself the man 's a human being just as i am he has just as much reason to fear me as i have"

In [42]:
generate_text(model, tokenizer, seq_len, seed_text = seed_text, num_gen_words = 25)

AttributeError: 'Sequential' object has no attribute 'predict_classes'