<a href="https://colab.research.google.com/github/Shahid1993/colab-notebooks/blob/master/word_completion_prediction_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Already Created Models

### Load Model from Google Drive

In [1]:
# Mounting Google Drive to Load Data
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import numpy as np
from keras.models import load_model
import pickle
import heapq

In [0]:
model = load_model('./drive/My Drive/ML/Models/word_completion_prediction/word_completion_prediction_keras_model.h5')
history = pickle.load(open("./drive/My Drive/ML/Models/word_completion_prediction/word_completion_prediction_history.p", "rb"))

In [0]:
chars = ' !"\'(),-.0123456789:;?_abcdefghijklmnopqrstuvwxyz¦'
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [0]:
def prepare_input(text):
    x = np.zeros((1, len(text), len(chars)))
    for t, char in enumerate(text):
        x[0, t, char_indices[char]] = 1.
        
    return x

In [0]:
def sample(preds, top_n=3):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    
    return heapq.nlargest(top_n, range(len(preds)), preds.take)

In [0]:
def predict_completion(text):
    original_text = text
    generated = text
    completion = ''
    while True:
        x = prepare_input(text)
        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, top_n=1)[0]
        next_char = indices_char[next_index]
        text = text[1:] + next_char
        completion += next_char
        
        if len(original_text + completion) + 2 > len(original_text) and next_char == ' ':
            return completion

In [0]:
def predict_completions(text, n=3):
    x = prepare_input(text)
    preds = model.predict(x, verbose=0)[0]
    next_indices = sample(preds, n)
    return [indices_char[idx] + predict_completion(text[1:] + indices_char[idx]) for idx in next_indices]

In [0]:
# actual_text = [
#     "It is not a lack of love, but a lack of friendship that makes unhappy marriages.",
#     "That which does not kill us makes us stronger.",
#     "I'm not upset that you lied to me, I'm upset that from now on I can't believe you.",
#     "And those who were seen dancing were thought to be insane by those who could not hear the music.",
#     "It is hard enough to remember my opinions, without also remembering my reasons for them!",
#     "A man lying on a comfortable sofa is listening to his wi",
#     "Assuming the predictions are probabilistic, novel sequences can be generated from a trai",
#     "The networks performance is competitive with state-of-the-art language models, and it works almost",
#     "This document is the initial part of a study to predict next words from a text dataset"
# ]

input = [
    "It is not a lack of lov",
    "That which does not kill us makes us stro",
    "I'm not upset that you lied to me, I'm upset that from now on I can't bel",
    "And those who were seen dan",
    "It is hard enough to remember my opini",
    "A man lying on a comfortable ch",
    "The networks perf",
    "The networks performance is competi",
    "The networks performance is competitive with state-of-the-art lan",
    "This document is the initial part of a study to pre",
    "This document is the initial part of a study to pred",
    "Assuming the prediction",
    "Assuming the predictions are probabilistic, novel sequences can be gene",
    "Assuming the predictions are probabilistic, novel sequences can be generat"
]

In [38]:
for i in input:
    seq = i.lower()
    print(seq)
    print(predict_completions(seq, 5))
    print()

it is not a lack of lov
['e ', 'ical ', 'ality ', 'oure ', 'uling ']

that which does not kill us makes us stro
['ng ', 'dger ', 've ', 'gget ', 'w ']

i'm not upset that you lied to me, i'm upset that from now on i can't bel
['ieve ', 'ong ', 'aes ', 'ess ', 'low ']

and those who were seen dan
['gerous ', 'king ', 'ders ', 'y ', 'ce ']

it is hard enough to remember my opini
['on ', 'an ', 'ty ', 'fic ', 's ']

a man lying on a comfortable ch
['ild ', 'aracteristic ', 'ristian ', 'erristic ', 'omes ']

the networks perf
['ectly ', 'ord ', 'aind, ', 'iced ', 'uch ']

the networks performance is competi
['tion ', 'ce, ', 'ences ', 'sion ', 'ons ']

the networks performance is competitive with state-of-the-art lan
['ger ', 'ds ', 'k ', 'ce ', 'ture ']

this document is the initial part of a study to pre
['sent ', 'dicate ', 'cisely ', 'vail ', 'juce ']

this document is the initial part of a study to pred
['icate ', 'ention ', 'ucate ', 'action ', 'ocation ']

assuming the prediction
['

# Corpus Preprocessing

In [79]:
#path = 'nietzsche.txt'

#path = "./drive/My Drive/ML/data/nietzsche.txt"

#path = "./drive/My Drive/ML/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100"

path = "./drive/My Drive/ML/data/word_pred.txt"

text = open(path).read().lower()
print('corpus length:', len(text))

corpus length: 11646654


In [80]:
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print(f'unique chars: {len(chars)}')

print(chars)

print(''.join(map(str, chars)))

unique chars: 71
['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '?', '@', '[', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¤', '¦', '©', '«', 'ã', 'ä', '’', '“', '”', '†']

 !"$%&'()*+,-.0123456789:;<=?@[]_`abcdefghijklmnopqrstuvwxyz¤¦©«ãä’“”†


In [0]:
def preprocess(data):
    punct = '\n#$<=>[\\]@^{|}~¡¢£¤¥©«¬®°²´µ¶·º»¼½¾¿×àáâãäåæçèéêëíîïñóôõöøùúüþąćĕěœšŵžʼ˚а‎‐‑‚‟†•′₤€∆④●♥ﬁ（）￡�'
    
    for p in punct:
        data = data.replace(p, '')
        
    return data
  
text = preprocess(text)

In [82]:
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print(f'unique chars: {len(chars)}')

print(chars)

print(''.join(map(str, chars)))

print('corpus length:', len(text))

unique chars: 58
[' ', '!', '"', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¦', '’', '“', '”']
 !"%&'()*+,-.0123456789:;?_`abcdefghijklmnopqrstuvwxyz¦’“”
corpus length: 11464282


In [0]:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
 
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
 
tokenizer = PunktSentenceTokenizer(trainer.get_params())

In [48]:
# Test the tokenizer on a piece of text
sentences = "Mr. James told me Dr. Brown is not available today. I will try tomorrow."
 
print (tokenizer.tokenize(sentences))

['Mr.', 'James told me Dr.', 'Brown is not available today.', 'I will try tomorrow.']


In [0]:
import nltk

In [53]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> Download
Command 'Download' unrecognized

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Ex

True

In [0]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)

In [55]:
sentences[101]

'"they have to rebuild .i also like everyday vanilla ice cream with the sides of the sandwich rolled in flaked coconut ( do this just after you fill the cookies so the ice cream is still soft enough for the flakes to adhere ) .the data showed a ratio of 2.9 birth defects per 1,000 live births in kettleman city during those years .still , the hornets , with hilton armstrong starting at center for chandler ( ankle ) , went toe-to-toe with the nuggets until denver \'s third-quarter run started the celebration .rouen , france , feb .a definitive destination for advertisements from football \'s biggest night.'

In [56]:
sentences[500]

'prudent and appropriate thing for chrysler to do to engage in the filings that they -- that received some notice a while back because they had to prepare for possible contingencies .mr fallon , 42 , two other riders and three other people were cleared after a key witness was undermined .under the normal rules of capitalism , any industry that can produce double-digit annual growth should soon be swamped by eager competitors until returns are driven down .the patient , villimin colleti , 71 , was taken to coney island hospital suffering from heart and brain damage , said the office of brooklyn district attorney charles j. hynes .some criminal always breaks the law and has a gun .the yankees star said the cousin told him it would give him a " dramatic energy boost " and repeatedly injected him from 2001-03 .liberal democrat mp evan harris says he has cross-party support for his measure to remove major discriminatory restrictions from the 1701 act of settlement , the independent reported

In [0]:
import numpy as np
np.random.seed(42)
import tensorflow as tf
tf.set_random_seed(42)
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dropout, CuDNNLSTM
from keras.layers import TimeDistributed
from keras.layers.core import Dense, Activation, Dropout, RepeatVector
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import sys
import heapq
import seaborn as sns
from pylab import rcParams

%matplotlib inline

sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 12, 5

In [85]:
#path = 'nietzsche.txt'

#path = "./drive/My Drive/ML/data/nietzsche.txt"

#path = "./drive/My Drive/ML/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100"

path = "./drive/My Drive/ML/data/word_pred.txt"

text = open(path).read().lower()
print('corpus length:', len(text))

corpus length: 11646654


In [86]:
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print(f'unique chars: {len(chars)}')

print(chars)

print(''.join(map(str, chars)))

unique chars: 71
['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '?', '@', '[', ']', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¤', '¦', '©', '«', 'ã', 'ä', '’', '“', '”', '†']

 !"$%&'()*+,-.0123456789:;<=?@[]_`abcdefghijklmnopqrstuvwxyz¤¦©«ãä’“”†


In [0]:
def preprocess(data):
    punct = '\n#$<=>[\\]@^{|}~¡¢£¤¥©«¬®°²´µ¶·º»¼½¾¿×àáâãäåæçèéêëíîïñóôõöøùúüþąćĕěœšŵžʼ˚а‎‐‑‚‟†•′₤€∆④●♥ﬁ（）￡�'
    
    for p in punct:
        data = data.replace(p, '')
        
    return data
  
text = preprocess(text)

In [88]:
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print(f'unique chars: {len(chars)}')

print(chars)

print(''.join(map(str, chars)))

print('corpus length:', len(text))

unique chars: 58
[' ', '!', '"', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¦', '’', '“', '”']
 !"%&'()*+,-.0123456789:;?_`abcdefghijklmnopqrstuvwxyz¦’“”
corpus length: 11464282


In [89]:
SEQUENCE_LENGTH = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - SEQUENCE_LENGTH, step):
    sentences.append(text[i: i + SEQUENCE_LENGTH])
    next_chars.append(text[i + SEQUENCE_LENGTH])
print(f'num training examples: {len(sentences)}')

num training examples: 3821414


In [0]:
X = np.zeros((len(sentences), SEQUENCE_LENGTH, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [0]:
model = Sequential()
#model.add(LSTM(128, input_shape=(SEQUENCE_LENGTH, len(chars))))

#model.add(CuDNNLSTM(128, input_shape=(None, len(chars))))

model.add(CuDNNLSTM(128, input_shape=(None, len(chars)), return_sequences=True))
#model.add(CuDNNLSTM(256, return_sequences=True))
model.add(CuDNNLSTM(256))

#Dropout added to avoid overfitting
model.add(Dropout(rate = 0.2))

# build model using keras documentation recommended optimizer initialization
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

model.add(Dense(len(chars)))
model.add(Activation('softmax'))

In [92]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
cu_dnnlstm_4 (CuDNNLSTM)     (None, None, 128)         96256     
_________________________________________________________________
cu_dnnlstm_5 (CuDNNLSTM)     (None, 256)               395264    
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 58)                14906     
_________________________________________________________________
activation_3 (Activation)    (None, 58)                0         
Total params: 506,426
Trainable params: 506,426
Non-trainable params: 0
_________________________________________________________________


In [93]:
#optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, y, validation_split=0.05, batch_size=128, epochs=10, shuffle=True).history

Train on 3630343 samples, validate on 191071 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [0]:
model.save('./drive/My Drive/ML/Models/word_completion_prediction/R3/word_completion_prediction_keras_model.h5')
pickle.dump(history, open("./drive/My Drive/ML/Models/word_completion_prediction/R3/word_completion_prediction_history.p", "wb"))

In [0]:
model = load_model('./drive/My Drive/ML/Models/word_completion_prediction/R3/word_completion_prediction_keras_model.h5')
history = pickle.load(open("./drive/My Drive/ML/Models/word_completion_prediction/R3/word_completion_prediction_history.p", "rb"))

In [0]:
def prepare_input(text):
    x = np.zeros((1, len(text), len(chars)))
    for t, char in enumerate(text):
        x[0, t, char_indices[char]] = 1.
        
    return x

In [0]:
def sample(preds, top_n=3):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    
    return heapq.nlargest(top_n, range(len(preds)), preds.take)

In [0]:
def predict_completions(text, n=3):
    x = prepare_input(text)
    preds = model.predict(x, verbose=0)[0]
    next_indices = sample(preds, n)
    return [indices_char[idx] + predict_completion(text[1:] + indices_char[idx]) for idx in next_indices]

In [0]:
# actual_text = [
#     "It is not a lack of love, but a lack of friendship that makes unhappy marriages.",
#     "That which does not kill us makes us stronger.",
#     "I'm not upset that you lied to me, I'm upset that from now on I can't believe you.",
#     "And those who were seen dancing were thought to be insane by those who could not hear the music.",
#     "It is hard enough to remember my opinions, without also remembering my reasons for them!",
#     "A man lying on a comfortable sofa is listening to his wi",
#     "Assuming the predictions are probabilistic, novel sequences can be generated from a trai",
#     "The networks performance is competitive with state-of-the-art language models, and it works almost",
#     "This document is the initial part of a study to predict next words from a text dataset"
# ]

input = [
    "It is not a lack of lov",
    "That which does not kill us makes us stro",
    "I'm not upset that you lied to me, I'm upset that from now on I can't bel",
    "And those who were seen dan",
    "It is hard enough to remember my opini",
    "A man lying on a comfortable ch",
    "Assuming the pre",
    "The networks performance is competi",
    "The networks performance is competitive with state-of-the-art lan",
    "This document is the initial part of a study to pre",
    "This document is the initial part of a study to pred",
    "Assuming the prediction",
    "Assuming the predictions are probabilistic, novel sequences can be gene",
    "Assuming the predictions are probabilistic, novel sequences can be generat"
]

In [102]:
for i in input:
    seq = i.lower()
    print(seq)
    print(predict_completions(seq, 5))
    print()

it is not a lack of lov
['ed ', 'ing ', ' .a ', 's ', 'anis ']

that which does not kill us makes us stro
['nger ', 'ller ', 'den ', 'te ', 'om ']

i'm not upset that you lied to me, i'm upset that from now on i can't bel
['ong ', 'ieve ', 'l ', 'ess ', 'rear ']

and those who were seen dan
['cing ', 'gerous ', 'd ', ' in ', 'ment ']

it is hard enough to remember my opini
['on ', 'ng ', 'ty ', 'st ', 'al ']

a man lying on a comfortable ch


  This is separate from the ipykernel package so we can avoid doing imports until


['air ', 'eckered ', 'ild ', 'ristmas ', 'urch ']

assuming the pre
['sent ', 'paration ', 'cise ', 'asure ', 'tent ']

the networks performance is competi
['ng ', 'tion ', 's ', 'cing ', 'on ']

the networks performance is competitive with state-of-the-art lan
['ds ', 'gers ', 'ese ', 'ter ', 'chers ']

this document is the initial part of a study to pre
['sent ', 'pare ', 'tend ', 'ase ', 'cise ']

this document is the initial part of a study to pred
['uce ', 'ical ', 'om ', 'ent ', 'se ']

assuming the prediction
[' of ', 's ', 'a ', 'ing ', 'ed ']

assuming the predictions are probabilistic, novel sequences can be gene
['rally ', 'd ', 'ment ', 'nded ', 'ther ']

assuming the predictions are probabilistic, novel sequences can be generat
['ion ', 'ure ', 'ed ', 'ory ', 'ly ']

