# Language Modeling

## Statistical Language Modeling

### Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.


### The neural network approach to language modeling can be described using the three following model properties, taken from A Neural Probabilistic Language Model, 2003.

* Associate each word in the vocabulary with a distributed word feature vector.
* Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
* Learn simultaneously the word feature vector and the parameters of the probability function.


## How to Develop a Character-Based Neural Language Model

In [1]:
filename = '/Users/test/Documents/Software-projects/Python Projects/Deep-Learning-Projects/Deep-Learning-Overfitting-Cook-Book/data/rhyme.txt'

In [2]:

# load doc into memory
def load_doc(filename):
  # open the file as read only 
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file 
  file.close()
  return text

# save tokens to file, one dialog per line
def save_doc(lines, filename): 
  data = '\n'.join(lines)
  file = open(filename, 'w')
  file.write(data)
  file.close()

# load text
raw_text = load_doc(filename) 
print(raw_text)
# clean
tokens = raw_text.split() 
raw_text = ' '.join(tokens)
# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
  # select sequence of tokens
  seq = raw_text[i-length:i+1]
  # store
  sequences.append(seq)
print('Total Sequences: %d' % len(sequences)) # save sequences to file
out_filename = 'char_sequences.txt' 
save_doc(sequences, out_filename)


Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.
When the pie was opened The birds began to sing; Wasn't that a dainty dish, To set before the king.
The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.
The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.
Total Sequences: 399


## Train Language Model

In [3]:
from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# load doc into memory
def load_doc(filename):
  # open the file as read only 
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file 
  file.close()
  return text
# define the model
def define_model(X):
  model = Sequential()
  model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars)) 
sequences = list()
for line in lines:
  # integer encode line
  encoded_seq = [mapping[char] for char in line]
  # store
  sequences.append(encoded_seq)
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X] 
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(X)
# fit model
model.fit(X, y, epochs=100, verbose=2)
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

Vocabulary Size: 38
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 75)                34200     
                                                                 
 dense (Dense)               (None, 38)                2888      
                                                                 
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
"dot" with args ['-Tps', '/var/folders/d2/2759wj910z3_bl8zjx2l83tr0000gp/T/tmpggkml91s'] returned code: 1

stdout, stderr:
 b''
b'Format: "ps" not recognized. No formats found.\nPerhaps "dot -c" needs to be run (with installer\'s privileges) to register the plugins?\n'



AssertionError: "dot" with args ['-Tps', '/var/folders/d2/2759wj910z3_bl8zjx2l83tr0000gp/T/tmpggkml91s'] returned code: 1

In [2]:
from keras.preprocessing.text import Tokenizer
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
     Jack fell down and broke his crown\n
And Jill came tumbling after\n """
    
tokenizer = Tokenizer() 
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1 
print('Vocabulary Size: %d' % vocab_size) # create line-based sequences
sequences = list()
for line in data.split('\n'):
  encoded = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(encoded)):
    sequence = encoded[:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Vocabulary Size: 22
Total Sequences: 21


In [3]:

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras_preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
  in_text = seed_text
  # generate a fixed number of words
  for _ in range(n_words):
    # encode the text as integer
    encoded = tokenizer.texts_to_sequences([in_text])[0]
    # pre-pad sequences to a fixed length
    encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word
    yhat = model.predict_classes(encoded, verbose=0)
  # map predicted word index to word
    out_word = ''
    for word, index in tokenizer.word_index.items():
        if index == yhat:
          out_word = word
          break
        # append to input
    in_text += ' ' + out_word 
  return in_text


# define the model
def define_model(vocab_size, max_length):
  model = Sequential()
  model.add(Embedding(vocab_size, 10, input_length=max_length-1))
  model.add(LSTM(50))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile network
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # summarize defined model
  model.summary()
  return model


# source text
data = """ Jack and Jill went up the hill\n
    To fetch a pail of water\n
    Jack fell down and broke his crown\n
And Jill came tumbling after\n """

tokenizer = Tokenizer() 
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1 
print('Vocabulary Size: %d' % vocab_size) # create line-based sequences
sequences = list()
for line in data.split('\n'):
  encoded = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(encoded)):
    sequence = encoded[:i+1]
sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') 
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4)) 
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Vocabulary Size: 22
Total Sequences: 1
Max Sequence Length: 5
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 10)             220       
                                                                 
 lstm_1 (LSTM)               (None, 50)                12200     
                                                                 
 dense_1 (Dense)             (None, 22)                1122      
                                                                 
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500


2024-04-20 08:14:17.217552: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


1/1 - 1s - loss: 3.0914 - accuracy: 0.0000e+00 - 745ms/epoch - 745ms/step
Epoch 2/500
1/1 - 0s - loss: 3.0832 - accuracy: 1.0000 - 3ms/epoch - 3ms/step
Epoch 3/500
1/1 - 0s - loss: 3.0749 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 4/500
1/1 - 0s - loss: 3.0664 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 5/500
1/1 - 0s - loss: 3.0577 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 6/500
1/1 - 0s - loss: 3.0485 - accuracy: 1.0000 - 3ms/epoch - 3ms/step
Epoch 7/500
1/1 - 0s - loss: 3.0390 - accuracy: 1.0000 - 3ms/epoch - 3ms/step
Epoch 8/500
1/1 - 0s - loss: 3.0289 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 9/500
1/1 - 0s - loss: 3.0183 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 10/500
1/1 - 0s - loss: 3.0070 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 11/500
1/1 - 0s - loss: 2.9950 - accuracy: 1.0000 - 2ms/epoch - 2ms/step
Epoch 12/500
1/1 - 0s - loss: 2.9823 - accuracy: 1.0000 - 3ms/epoch - 3ms/step
Epoch 13/500
1/1 - 0s - loss: 2.9687 - accuracy: 1.0000 - 2ms/epo

AttributeError: 'Sequential' object has no attribute 'predict_classes'