<a href="https://colab.research.google.com/github/Ryanh8/NextWordPredictor/blob/main/Next_Word_Prediction_using_Universal_Sentence_Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Google Pre-trained Universal Sentences Encoder to train a NLP Model


# Build the Model

In [3]:
# Getting all required libraries

import os
import re
import gdown
import numpy
import string
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from absl import logging
import tensorflow_hub as hub
from tensorflow import keras
import matplotlib.pyplot as plt
from keras.models import Sequential
import tensorflow.keras.backend as K
from keras.layers.recurrent import LSTM
from keras.layers import Dense, Activation
from keras.callbacks import LambdaCallback
from keras.utils.data_utils import get_file
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split

## **Data preparation - _Generating Corpus_**

In [4]:
!wget https://raw.githubusercontent.com/maxim5/stanford-tensorflow-tutorials/master/data/arxiv_abstracts.txt -O corpus.txt

--2021-02-06 16:18:55--  https://raw.githubusercontent.com/maxim5/stanford-tensorflow-tutorials/master/data/arxiv_abstracts.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7540200 (7.2M) [text/plain]
Saving to: ‘corpus.txt’


2021-02-06 16:18:55 (112 MB/s) - ‘corpus.txt’ saved [7540200/7540200]



In [8]:
# Read local file from directory
with open('corpus.txt') as subject:
  cache = subject.readlines()
translator = str.maketrans('', '', string.punctuation) # Remove punctuation
lines = [doc.lower().translate(translator) for doc in cache] # Switch to lower case

In [7]:
# PREVIEW OUTPUT ::

print(lines[0][:100])
len(lines)

in science and engineering intelligent processing of complex signals such as images sound or languag


7200

In [9]:
# Generate an list of single/independent words

vocabulary = list(set(' '.join(lines).replace('\n','').split(' ')))
primary_store = {}
for strings, texts in enumerate(vocabulary):
  primary_store[texts] = strings

In [10]:
# PREVIEW OUTPUT ::

print(vocabulary[:50])
len(vocabulary)
print(primary_store)

['', 'handwriting', 'em', 'allowing', 'computationally', 'renewed', 'obtained', 'opportunities', 'sometimes', 'phase', 'spark', 'big', 'about', 'generates', 'replaces', 'subnetworks', 'mediated', 'hmm', 'worlds', 'heuristic', 'suboptimal', 'yielded', 'systems', 'explicitly', 'irregular', 'thought', 'demonstrate', 'terms', 'rprops', 'treatment', 'university', 'twodimensions', 'sizes', 'rigorous', 'prove', 'patterns', 'rare', 'these', 'tested', 'phoneme', 'transitions', 'strided', 'chosen', 'undirected', 'learned', 'finegrained', 'confidentinformationfirst', 'complex', 'arbitrary', 'coding']
{'': 0, 'handwriting': 1, 'em': 2, 'allowing': 3, 'computationally': 4, 'renewed': 5, 'obtained': 6, 'opportunities': 7, 'sometimes': 8, 'phase': 9, 'spark': 10, 'big': 11, 'about': 12, 'generates': 13, 'replaces': 14, 'subnetworks': 15, 'mediated': 16, 'hmm': 17, 'worlds': 18, 'heuristic': 19, 'suboptimal': 20, 'yielded': 21, 'systems': 22, 'explicitly': 23, 'irregular': 24, 'thought': 25, 'demonstr

In [11]:
# Splitting data into Train sets and test sets

X = [] 
y = []

for c in lines:
  xxxx = c.replace('\n','').split(' ')
  X.append(' '.join(xxxx[:-1])) # X from the corpus

  yyyy = [0 for i in range(len(vocabulary))] # Generate Y from the Vocabulary
  # yyyy[primary_store[xxxx[-1]]] = 1
  yyyy[primary_store[xxxx[-1]]] = 1
  y.append(yyyy)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
y_test = numpy.array(y_test)
y_train = numpy.array(y_train)

In [12]:
# PREVIEW OUTPUT ::

print(X_train[:10])
print(y_train[:10])
print(X_test[:10])
print(y_test[:10])

['in this paper we present an infinite hierarchical nonparametric bayesian model to extract the hidden factors over observed data where the number of hidden factors for each layer is unknown and can be potentially infinite moreover the number of layers can also be infinite we construct the model structure that allows continuous values for the hidden factors and weights which makes the model suitable for various applications we use the metropolishastings method to infer the model structure then the performance of the algorithm is evaluated by the experiments simulation results show that the model fits the underlying structure of simulated', 'we study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have deep networks are able to sequentially map portions of each layers inputspace to the same output in this way deep models compute functions that react equally 

## **Embeddings!**

In [13]:
# Import the Universal Sentence Encoder's TF Hub module (Here we're making use of version 4)
# This will take a while but won't be long :)

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"  
appreciate = hub.load(module_url)


In [None]:
# REVIEW OUTPUT ::

appreciate.variables

In [None]:
# Wrapping up with the U-S-E

X_train = appreciate(X_train)
X_test = appreciate(X_test)
X_train = X_train.numpy()
X_test = X_test.numpy()

In [None]:
# PREVIEW OUTPUT ::

print(X_train[:10])
print(y_train[:10])
print(X_test[:10])
print(y_test[:10])
print(X_train.shape, X_test.shape, y_test.shape, y_train.shape)

## **Building the model**

In [None]:
model = Sequential()
model.add(Dense(512, input_shape=[512], activation = 'relu'))
model.add(Dense(units=len(vocabulary), activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

## Train the model

In [None]:
model.fit(X_train, y_train, batch_size=512, shuffle=True, epochs=100, validation_data=(X_test, y_test), callbacks=[LambdaCallback()])

# Save the Model

In [None]:
vocabulary = numpy.array(vocabulary)
numpy.save('vocabulary.npy', vocabulary)
model.save('arxiv_abstract_model')

# Validate the Model

## Restore the saved model

In [None]:
pre_trained_model = tf.keras.models.load_model('arxiv_abstract_model')
pre_trained_model.summary()

vocabulary = np.load('vocabulary.npy')
print(len(vocabulary))

## Start the demo

In [None]:
# Create function to predict and show detailed output

def next_word(model, collection=[], extent=1):

  for item in collection:
    text = item
    for i in range(extent):
      prediction = model.predict(x=appreciate([item]).numpy())
      idx = np.argmax(prediction[-1])
      item += ' ' + vocabulary[idx]
      
      print(text + ' --> ' + item + '\nNEXT WORD: ' + item.split(' ')[-1] + '\n')

In [None]:
# Testing on a collection of words

text_collection = ['this article improve', 'deep adversarial', 'a nonconvex', 'parallel']

next_word(pre_trained_model, text_collection)