**Loading the Data**

The first step is we will tryi to mount the google drive with our notebook so that we c an get access to all of our directories and files on our Google Drive account.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


In [2]:
pip install keras-preprocessing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras-preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: keras-preprocessing
Successfully installed keras-preprocessing-1.1.2


We'll try to import the libraries now. 

In [12]:
import matplotlib.pyplot as plt
import nltk
import tensorflow as tf

# keras module for building LSTM 
from keras_preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
from tensorflow import keras

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

**Get Lines From The Text**

The next step is that we'll try to read the text file and tokenize it into sentences. We'll open and append it into a list called book. 

In [4]:
book = []
with open('/content/drive/MyDrive/deep_learning/assignments/Cycle_GAN/Language_Model/internet_archive_scifi_v3.txt') as pdf:
  for line in pdf:
    book.append(line)
book [0] = book[0][:len(book[0])//1000]

We preprocess the text we've obtained from the previous step by removing spaces in each lines and convert all the words to be lower case because tokenization will treat words that has capital letters different with those which don't have capital letters even though the words are the same. We need to avoid this problem by removing spaces and conver the words into lower case. 

In [5]:
import string
punctuations = string.punctuation
punctuations += '1234567890'
eol = '.!?'

cleaned_book = []
for line in book:
  cleaned_line =' '
  for char in line:
    if char in eol:
      cleaned_line += ' . '
      cleaned_line = cleaned_line.lower ()
      cleaned_book.append(cleaned_line)pre
      cleaned_line = ' '
      continue
    if char in punctuations or char == '\n':
      continue
    cleaned_line += char
#cleaned _book.append (cleaned_line)
all_text = ' \n '.join(cleaned_book)
print(all_text [:2000])

 march  all stories new and complete publisher editor if is published bimonthly by quinn publishing company inc .  
   kingston new york .  
   volume  no .  
    .  
   copyright  by quinn publishing company inc .  
   application for entry as second class matter at post office buffalo new york pending .  
   subscription  for  issues in u .  
  s .  
   and possessions canada  for  issues elsewhere  .  
   aiiow four weeks for change of address .  
   all stories appearing in this magazine are fiction .  
   any similarity to actual persons is coincidental .  
   c a fcopy .  
   printed ia u .  
  s .  
   a .  
   a chat with the editor  i   science fiction magazine called if .  
   the title was selected after much thought because of its brevity and on the theory it is indicative of the field and will be easy to remember .  
   the tentative title that just morning and couldnt remember it until wed had a cup of coffee it was summarily discarded .  
   a great deal of thought and e

**Dataset Cleaning**

The next step is cleaning our dataset by removing ascii characters, punctuations because it can help N-Grams to see the next words more properly.

In [6]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in cleaned_book]
corpus[:10]

[' march  all stories new and complete publisher editor if is published bimonthly by quinn publishing company inc  ',
 '  kingston new york  ',
 '  volume  no  ',
 '    ',
 '  copyright  by quinn publishing company inc  ',
 '  application for entry as second class matter at post office buffalo new york pending  ',
 '  subscription  for  issues in u  ',
 ' s  ',
 '  and possessions canada  for  issues elsewhere   ',
 '  aiiow four weeks for change of address  ']

**Generating Sequence of N-gram Tokens**

The next step is to tokenize the sentences to be numbers which are recognized by computers to proceed into our algorithm later. We implement our N-Gram to see what's the next word after our current words. 

To predict next word/ token, we need a sequence input data (which is in this case is words/ tokens). That will be fed into our language model (LSTM). Therefore, tokenization plays an important role to extract tokens (terms/ words) from a corpus we provide. 

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. After this step, every text document in the dataset is converted into sequence of tokens.

In [7]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[1970, 41],
 [1970, 41, 421],
 [1970, 41, 421, 228],
 [1970, 41, 421, 228, 5],
 [1970, 41, 421, 228, 5, 771],
 [1970, 41, 421, 228, 5, 771, 1971],
 [1970, 41, 421, 228, 5, 771, 1971, 772],
 [1970, 41, 421, 228, 5, 771, 1971, 772, 57],
 [1970, 41, 421, 228, 5, 771, 1971, 772, 57, 37],
 [1970, 41, 421, 228, 5, 771, 1971, 772, 57, 37, 1972]]

Based on the result above, every integer corresponds to the index of a particular word in the complete vocabulary of words present in the text.

**Padding the Sequences and Obtain Variables : Predictors and Target**

In this step, we'll generate our padded sequences so that we can get our predictors, labels and the maximum of sequence length. Different sequences might have different lengths. Therefore, we need to pad the sequencesand make their lengths equal because our trained model layers is fixed. To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label. 

In [8]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

Now we can get the input vector X and the label vector Y. We'll be used them both for training. 

**LSTM Model**

We'll use LSTM by leveraging the activation outputs from neurons which will propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks (RNN). The propagtion of both directions will act as a memory state of the neurons in our neural network architecture so that our network can still remember what it has learned from the previous steps. 

LSTMs will solve the issue of Vanishing Gradient in RNN. Therefore, we chose LSTM to build our text generations so that it's easy for our networks to learn and tune the parameters of the earlier layers. The cell state in LSTMs helps the network makes adjustments in the information flow. It will help the model to remember or forget the learnings more selectively. We will run this model for total 100 epoochs but it can be experimented further.

**Generating Text**

We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [9]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.2))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 83, 10)            47310     
                                                                 
 lstm (LSTM)                 (None, 100)               44400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 4731)              477831    
                                                                 
Total params: 569,541
Trainable params: 569,541
Non-trainable params: 0
_________________________________________________________________


In [13]:
checkpoint_filepath = "/content/drive/MyDrive/deep_learning/assignments/Cycle_GAN/Language_Model/model_checkpoints/language_model_checkpoints.{epoch:03d}"
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath
)

model.fit(predictors, label, epochs=100, verbose=5, callbacks=[model_checkpoint_callback])

Epoch 1/100




Epoch 2/100




Epoch 3/100




Epoch 4/100




Epoch 5/100




Epoch 6/100




Epoch 7/100




Epoch 8/100




Epoch 9/100




Epoch 10/100




Epoch 11/100




Epoch 12/100




Epoch 13/100




Epoch 14/100




Epoch 15/100




Epoch 16/100




Epoch 17/100




Epoch 18/100




Epoch 19/100




Epoch 20/100




Epoch 21/100




Epoch 22/100




Epoch 23/100




Epoch 24/100




Epoch 25/100




Epoch 26/100




Epoch 27/100




Epoch 28/100




Epoch 29/100




Epoch 30/100




Epoch 31/100




Epoch 32/100




Epoch 33/100




Epoch 34/100




Epoch 35/100




Epoch 36/100




Epoch 37/100




Epoch 38/100




Epoch 39/100




Epoch 40/100




Epoch 41/100




Epoch 42/100




Epoch 43/100




Epoch 44/100




Epoch 45/100




Epoch 46/100




Epoch 47/100




Epoch 48/100




Epoch 49/100




Epoch 50/100




Epoch 51/100




Epoch 52/100




Epoch 53/100




Epoch 54/100




Epoch 55/100




Epoch 56/100




Epoch 57/100




Epoch 58/100




Epoch 59/100




Epoch 60/100




Epoch 61/100




Epoch 62/100




Epoch 63/100




Epoch 64/100




Epoch 65/100




Epoch 66/100




Epoch 67/100




Epoch 68/100




Epoch 69/100




Epoch 70/100




Epoch 71/100




Epoch 72/100




Epoch 73/100




Epoch 74/100




Epoch 75/100




Epoch 76/100




Epoch 77/100




Epoch 78/100




Epoch 79/100




Epoch 80/100




Epoch 81/100




Epoch 82/100




Epoch 83/100




Epoch 84/100




Epoch 85/100




Epoch 86/100




Epoch 87/100




Epoch 88/100




Epoch 89/100




Epoch 90/100




Epoch 91/100




Epoch 92/100




Epoch 93/100




Epoch 94/100




Epoch 95/100




Epoch 96/100




Epoch 97/100




Epoch 98/100




Epoch 99/100




Epoch 100/100




<keras.callbacks.History at 0x7f444b0443d0>

**Generating the text**

In [14]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        # predicted = model.predict_classes(token_list, verbose=0)
        predict_x=model.predict(token_list) 
        predicted=np.argmax(predict_x,axis=1)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [15]:
# Load the checkpoints
weight_file = "/content/drive/MyDrive/deep_learning/assignments/Cycle_GAN/Language_Model/model_checkpoints/language_model_checkpoints.070"
model.load_weights(weight_file).expect_partial()
print("Weights loaded successfully")

Weights loaded successfully


In [16]:
print(generate_text("climbing up the mountain", 300, model, max_sequence_len))

Climbing Up The Mountain Dough His Feet Off His Mouth And Restored To The Floor And The Man Was Called Down To The Last Key And The Man In The Terran S Would Be No Part To The Globe Away And The Man Shed T And The Man Was Called Down To The Floor And The Man Was Day And Left It Sounds At The Whole Named Miller The Only Larwich Was The Man Of The Seated Figure Of The Spotlight Wall Of The Beach Before The Of Itself Were Out From The Cone Of The Long Jaw Of The Various Controls Lifted To Make Him A Basic Paralysis That From Him In The Room And Strolled Out Into The Narrow Space Sum Of The Gleaming Vessel And The Globe Became The Lieutenants Hair The Lieutenants Hair The Bar And Sound Was A Short Human And Had A Silver Colored Asphyxiation Plea Out In The Light Could Been Been A Thin Of The Floor Of The Beach Before The Bolt Lay Hundreds Of Thousands Of The Floor With The Beds Footboard And The Restless Creak Of Barren Branches Kept It From Becoming Unbearable A Kind Of Town With The Oxyg

In [17]:
print(generate_text("What if we rewrite the stars", 300, model, max_sequence_len))

What If We Rewrite The Stars Writers In The Field He Could Have Have A Trap Which Would Be Light To The Floor Of The Left Of The Floor As The Globe Became A Blue Mist That Spiralled Lazily Was Out Of The Carpeted Treads Of The Carpet Side Of The Rounded Wall Of The Spinsters Nightstand And The Restless Creak Of Shoes Dropped The Ram And The Footboard Of Acceleration And Skimmed Along The Lines Panel Was Than A Couple Of Questions That Drew No Way To Her Deaths Of Kirks Mouth And Will Be A Different Across The Gun Even Even A Deep Across The Floor And The Restless Creak Of Barren Branches Kept His Ears Staring Was Began To Klia Was A Different Paper On The Cover Of The Seated Figure Of The Spinsters Nightstand And The Globe Of The Brain Of The Floor And The Floor From The Floor Of The Long Lab Came Out Plunging The Side Of The Floor Of The Spinsters Bole And The Restless Creak Of Shoes Dropped The Of Tube The Buttons And The Man Became Completed It Was Much Before He Didnt Drunk A Nine 

In [19]:
print(generate_text("Walking on the shore", 100, model, max_sequence_len))

Walking On The Shore Building The Ones Of The Floor Of The Cubbyhole The Wall Of His Floor And The Restless Glass Commission Panel Began Into The Spinsters Nightstand And The Globe Was Those At The Right Suspended As The Of Itself The Globe And The Globe Became The North In The Sound Of The Opposite Tube The Night Lay Tobelongings And The Restless Means Of Barren Branches Kept His Physical Appearance And The Globe At The Opposite Even The Bar And Ing A Short Bomb From The Carpet Before The Lean Top From The Floor Of The Battered Tube Of The Spinsters Nightstand
