<a href="https://colab.research.google.com/github/MohebZandi/Deep_Learning_NLP/blob/main/2_Book_Deep_Learning_NLP_Jason_Brownlee_2020_Foundation_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Book_Deep_Learning_NLP_Jason Brownlee_2020**

# Section Two: 
Jason Brownlee
2020

# Part VII Language Modeling

**Neural Language Modeling**

Language modeling is central to many important natural language processing tasks. 

Recently, neural-network-based language models have demonstrated better performance than classical methods both standalone and as part of more challenging natural language processing tasks. 

In this chapter, you will discover language modeling for natural language processing.

This tutorial is divided into the following parts:
1. Problem of Modeling Language
2. Statistical Language Modeling
3. Neural Language Models

**Problem of Modeling Language**

Formal languages, like programming languages, can be fully specified. 

All the reserved words can be defined and the valid ways that they can be used can be precisely defined. 

We cannot do this with natural language. Natural languages are not designed; they emerge, and therefore there is no formal specification.

**Statistical Language Modeling**

Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.

A language model learns the probability of word occurrence based on examples of text.

Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs. 

Most commonly, language models operate at the level of words.

A language model is a function that puts a probability measure over strings drawn from some vocabulary.

**Language modeling is a crucial component in real-world applications such as machine-translation and automatic speech recognition.**

For these reasons, language modeling plays a central role in natural-language processing, AI, and machine learning research.

A good example is speech recognition, where audio data is used as an input to the model and the output requires a language model that interprets the input signal and recognizes each new word within the context of the words already recognized.

Similarly, language models are used to generate text in many similar natural language processing tasks, for example:
- Optical Character Recognition
- Handwriting Recognition.
- Machine Translation.
- Spelling Correction.
- Image Captioning.
- Text Summarization
- And much more.

**Neural Language Models**

Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach.

The use of neural networks in language modeling is often called **Neural Language Modeling**, or **NLM** for short.



**How to Develop a Character-Based Neural Language Model**

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. 

It is also possible to develop language models at the character
level using neural networks. 

The benefit of character-based language models is their small vocabulary and 
exibility in handling any words, punctuation, and other document structure.

In this tutorial, you will discover how to
develop a character-based neural language model. 

After completing this tutorial, you will know:
- How to prepare text for character-based language modeling.
- How to develop a character-based language model using LSTMs.
- How to use a trained character-based language model to generate text.

This tutorial is divided into the following parts:
1. Sing a Song of Sixpence
2. Data Preparation
3. Train Language Model
4. Generate Text

**Sing a Song of Sixpence**

Sing a song of sixpence,

A pocket full of rye.

Four and twenty blackbirds,

Baked in a pie.

When the pie was opened

The birds began to sing;

Wasn't that a dainty dish,

To set before the king.

The king was in his counting house,

Counting out his money;

The queen was in the parlour,

Eating bread and honey.

The maid was in the garden,

Hanging out the clothes,

When down came a blackbird

And pecked off her nose.

**Data Preparation**

The first step is to prepare the text data. We will start by defining the type of language model.

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters. 

The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character. 

After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

**Load Text**

function named *load doc()*

raw_text = load_doc('rhyme.txt')

print(raw_text)

**Clean Text**

Next, we need to clean the loaded text. We will not do much to it on this example. Speciffically, we will strip all of the new line characters so that we have one long sequence of characters separated only by white space.

**Create Sequences**

Now that we have a long list of characters, we can create our input-output sequences used to train the model. 

Each input sequence will be 10 characters with one output character, making
each sequence 11 characters long. 

We can create the sequences by enumerating the characters in the text, starting at the 11th character at index 10.

**Save Sequences**

function *save_doc()*

out_filename = 'char_sequences.txt'

save_doc(sequences, out_filename)

**Complete Example**

Tying all of this together, the complete code listing is provided below.

In [3]:
# Because there are lots of files to be read in memory, I have uploaded the folders in 
# Google Drive, so I have to mount it first, then read the data

# In this method the Authentication of google drive will open a new window.
# There are also other ways to inform the Auth to google in program text


from google.colab import drive
drive.mount('/content/gdrive')     # Mounting Google Drive in Colab

Mounted at /content/gdrive


In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# load text
raw_text = load_doc('/content/gdrive/My Drive/txt_sentoken//rhyme.txt')
print('raw_text:\n\n',raw_text)

# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)
# organize into sequences of characters
length = 10
sequences = list()

for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)

print('\n\nTotal Sequences: %d' % len(sequences))
# save sequences to file
out_filename = '/content/gdrive/My Drive/txt_sentoken/char_sequences.txt'
print('\n\nSample of Sequences:\n',sequences[:5])
save_doc(sequences, out_filename)

raw_text:

 Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.
When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.
The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.
The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.


Total Sequences: 399


Sample of Sequences:
 ['Sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ']


We are now ready to train our character-based neural language model.

**Load Data**

Calling *load_doc()*

in_filename = 'char_sequences.txt'

raw_text = load_doc(in_filename)

lines = raw_text.split('\n')

**Encode Sequences**

The sequences of characters must be encoded as integers. This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers. 

We can create the mapping given a sorted set of unique characters in the
raw input data. The mapping is a dictionary of character values to integer values.


**Split Inputs and Output**

Now that the sequences have been integer encoded, we can separate the columns into input and output sequences of characters. We can do this using a simple array slice.

sequences = array(sequences)

X, y = sequences[: , :-1], sequences[: , -1]

**Fit Model**

The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. 

Rather than specify these numbers, we use the second and third dimensions on the X input data. 

This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition. 

The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error. 

The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. 

A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

The model is learning a multiclass classiffication problem, therefore we use the categorical log loss intended for this type of problem. 

The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update.

**Save Model**

model.save('model_2.h5')

**Complete Example**

In [None]:
# Encode Sequences

chars = sorted(list(set(raw_text)))

mapping = dict((c, i) for i, c in enumerate(chars))

sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

In [None]:
from numpy import array
from pickle import dump
from tensorflow.keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# define the model
def define_model(X):
    model = Sequential()
    model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

# load
in_filename = '/content/gdrive/My Drive/txt_sentoken/char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(X)
model.pre
# fit model
model.fit(X, y, epochs=100, verbose=2)
# save the model to file
model.save('/content/gdrive/My Drive/txt_sentoken/model_2.h5')
# save the mapping
dump(mapping, open('/content/gdrive/My Drive/txt_sentoken/mapping.pkl', 'wb'))

Vocabulary Size: 38
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_2 (LSTM)               (None, 75)                34200     
                                                                 
 dense_2 (Dense)             (None, 38)                2888      
                                                                 
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
13/13 - 2s - loss: 3.6144 - accuracy: 0.0627 - 2s/epoch - 145ms/step
Epoch 2/100
13/13 - 0s - loss: 3.4974 - accuracy: 0.1654 - 82ms/epoch - 6ms/step
Epoch 3/100
13/13 - 0s - loss: 3.1485 - accuracy: 0.1905 - 86ms/epoch - 7ms/step
Epoch 4/100
13/13 - 0s - loss: 3.0611 - accuracy: 0.1905 - 99ms/epoch - 8ms/step
Epoch 5/100
13/13 - 0s - loss: 3.0089 - accuracy: 0.1905 - 100ms/epoch - 8ms/step
Epoch 6/100
13/13 - 

**Generate Text**

We will use the learned language model to generate new sequences of text that have the same statistical properties.

The first step is to load the model saved to the file model.h5. We can use the *load_model()* function from the Keras API.

We also need to load the pickled dictionary for mapping characters to integers from the file mapping.pkl. We will use the Pickle API to load the object.

**Complete Example**



In [7]:
from pickle import load
import numpy as np
from numpy import array
from keras.models import load_model
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = np.argmax(model.predict(encoded), axis= 1)

        # reverse map integer to character
        out_char = ''
        # print(yhat)
        # print()
        # print(classes_x)
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
            break
        # append to input
        in_text += out_char
    return in_text
# load the model
model = load_model('/content/gdrive/My Drive/txt_sentoken/model_2.h5')
# load the mapping
mapping = load(open('/content/gdrive/My Drive/txt_sentoken/mapping.pkl', 'rb'))

# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))

# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))

# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

Sing a son
king was i
hello worl


**How to Develop a Word-Based Neural Language Model**

Language modeling involves predicting the next word in a sequence given the sequence of words already present. 

A language model is a key element in many natural language processing models
such as machine translation and speech recognition.

This tutorial is divided into the following parts:
1. Framing Language Modeling
2. Jack and Jill Nursery Rhyme
3. Model 1: One-Word-In, One-Word-Out Sequences
4. Model 2: Line-by-Line Sequence
5. Model 3: Two-Words-In, One-Word-Out Sequence

**Framing Language Modeling**

Language models both learn and predict one word at a time. 

The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.

There are many ways to frame the sequences from a source text for language modeling. 

In this tutorial, we will explore 3 dierent ways of developing word-based language models in the Keras deep learning library. 

There is no single best approach, just different framings that may suit different applications.

**Jack and Jill Nursery Rhyme**

We will use this as our source text for exploring dierent framings of a word-based language
model. We can define this text in Python as follows:



In [None]:
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
    Jack fell down and broke his crown\n
    And Jill came tumbling after\n """

**Model 1: One-Word-In, One-Word-Out Sequences**

We can start with a very simple model. Given one word as input, the model will learn to predict the next word in the sequence. 

For example:

In [None]:
 X,        y
Jack,     and
and,      Jill
Jill,     went

The first step is to encode the text as integers. 

Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

Keras provides the Tokenizer class that can be used to perform this encoding. 

First, the *Tokenizer* is fit on the source text to develop the mapping from words to unique integers. 

Then sequences of text can be converted to sequences of integers by calling the *texts_to_sequences()* function.

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding. 

The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the *word_index* attribute.

In [None]:
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Running this example, we can see that the size of the vocabulary is 21 words. 

We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions. 

Next, we need to create
sequences of words to fit the model with one word as input and one word as output.

In [None]:
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
sequence = encoded[i-1:i+1]
sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

We can then split the sequences into input (X) and output elements (y). 

This is straightforward as we only have two columns in the data.

We are now ready to define the neural network model. The model uses a learned word embedding in the input layer. 

This has one real-valued vector for each word in the vocabulary,
where each word vector has a specified length. In this case we will use a 10-dimensional projection. 

The input sequence contains a single word, therefore the input length=1. The
model has a single hidden LSTM layer with 50 units. This is far more than is needed. 

The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. 

Here we pass in `Jack' by encoding it and calling
*model.predict_classes()* to get the integer output for the predicted word. 

This is then looked up in the vocabulary mapping to give the associated word.

We can tie all of this together. The complete code listing is provided below.

In [8]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = array(encoded)
        # predict a word in the vocabulary
        yhat = np.argmax(model.predict(encoded), axis= 1)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
            # append to input
            in_text, result = out_word, result + ' ' + out_word
    return result

# define the model
def define_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='/content/gdrive/My Drive/txt_sentoken/model.png', show_shapes=True)
    return model

# source text
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
    Jack fell down and broke his crown\n
    And Jill came tumbling after\n """

# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]

# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(vocab_size)

# fit network
model.fit(X, y, epochs=500, verbose=2)

# evaluate
print(generate_seq(model, tokenizer, 'Jack', 6))

Vocabulary Size: 22
Total Sequences: 24
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 1, 10)             220       
                                                                 
 lstm_2 (LSTM)               (None, 50)                12200     
                                                                 
 dense_2 (Dense)             (None, 22)                1122      
                                                                 
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500
1/1 - 2s - loss: 3.0909 - accuracy: 0.0417 - 2s/epoch - 2s/step
Epoch 2/500
1/1 - 0s - loss: 3.0902 - accuracy: 0.0417 - 6ms/epoch - 6ms/step
Epoch 3/500
1/1 - 0s - loss: 3.0894 - accuracy: 0.1250 - 6ms/epoch - 6ms/step
Epoch 4/500
1/1 - 0s - loss: 3.0886 - accu