# Data 622
## Assignment 8 - LSTM word prediction
Mark Ly

Student ID: 00504696

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN
from keras.callbacks import ModelCheckpoint
from tensorflow import keras
from tensorflow.keras import layers
print(tf.config.list_physical_devices('GPU'))
import numpy as np
import random
import io

!pip install wikipedia
import wikipedia

[]


## Loading text

From the wikipedia package, we are loading the Artificial Intelligence page force it to be in lower case. This is so we can reduce the vocabulary our network has to learn. We have a corpus length of 83,247

In [8]:
raw_text = wikipedia.page("Artificial Intelligence").content.lower()
print("Corpus length:", len(raw_text))

Corpus length: 83247


In [9]:
raw_text

'artificial intelligence (ai) is intelligence demonstrated by machines, as opposed to natural intelligence displayed by animals including humans. leading ai textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving", however, this definition is rejected by major ai researchers.ai applications include advanced web search engines (e.g., google), recommendation systems (used by youtube, amazon and netflix), understanding human speech (such as siri and alexa), self-driving cars (e.g., tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and go).\nas machines become increasingly capable, tasks considered to require "intelligenc

## Distinct Characters
We will need to create a dictionary for all the distinct characters and map each one of these characters to a unique interger. We have a total of 66 unique characters that are within this text source.

In [10]:
chars = sorted(list(set(raw_text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [11]:
n_vocab = len(chars)
print("Total Vocab: ", n_vocab)

Total Characters:  83247
Total Vocab:  66


## Training/Testing data
Next we will split up our corpus into training and testing data. The training length for each passage will be 100 characters. This gave us 27,716 sequences to work with.


In [12]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 100
step = 3
sentences = []
next_chars = []
for i in range(0, len(raw_text) - maxlen, step):
    sentences.append(raw_text[i : i + maxlen])
    next_chars.append(raw_text[i + maxlen])
print("Number of sequences:", len(sentences))

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print('y: ', y)
print('x: ', x[:,0])

Number of sequences: 27716


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  # This is added back by InteractiveShellApp.init_path()
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if sys.path[0] == '':


y:  [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False  True False ... False False False]
 [False  True False ... False False False]
 [False False False ... False False False]]
x:  [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


## Model Architecture  
For this LSTM model, there are 2 hidden layers with 128 memory units that have a dropout probability of 20 between each one. The output layer is a dense layer using softmax activation.

In [13]:
model = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars),)),
        layers.LSTM(128,return_sequences=True),
        layers.Dropout(0.2),
        layers.LSTM(128),
        layers.Dropout(0.2),
        layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)


In [14]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


## Model Fit
We will be fitting the above LSTM model using 50 epochs and a smaller batch size of 64 patterns

In [17]:
epochs = 50
batch_size = 64

filepath="weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

for epoch in range(epochs):
    model.fit(x, y, batch_size=batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(raw_text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        generated = ""
        sentence = raw_text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0
            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            sentence = sentence[1:] + next_char
            generated += next_char

        print("...Generated: ", generated)
        print()



Generating text after epoch: 0
...Diversity: 0.2
...Generating with seed: " that artificial intelligence is helpful in its current form and will continue to assist humans.
oth"
...Generated:  er an artificial intelligent wechnets the research the research conseal inferal to the the researchers inferational intelligent and the the research the researchent anderal inferal intelligent the researchent of the research anderal intelligent of a to and to and searchers ander anderal infeneral intelligent anterone anteral the researchers anderent inferal the researchers to the researchers ander

...Diversity: 0.5
...Generating with seed: " that artificial intelligence is helpful in its current form and will continue to assist humans.
oth"
...Generated:  es researchent andecening andets enenelal cenelenal a the wored intelligent of the research to the an hangesiceational to intelligent wechneration inferacters andesation of searcher of the provessic for actement of of a for thetoriginal on 20 t

  after removing the cwd from sys.path.


...Generated:   the biago many to a progress to symber to poning and proaling mithing of the digned and coulded algorithms ware to search the persomulations. archived from the original on 22 october 2006. retrieved 30 october 2005.
theter, stastan, s.; thoh, sgenear, c. (2000). "retuis of intelligence to many objects and the sogiend newwere of the simputing and incorces ai to the soft problems and benet the as u

...Diversity: 1.0
...Generating with seed: "ael (19 august 2015). "stephen hawking, elon musk, and bill gates warn about artificial intelligence"
...Generated:   in delidatetosalopon, rurn, betware be systen or part using artificial on search usess and l publes problems replate utexenss patberss, behist or aderstentanterd eval sonpend with an, imisitying. cpasigionation the anvertisipy of finteligence to macoined and that to pisks singons to robots bebnibats human and solve cill to propening as solve peraned intelligent seppochocolmy di toner intelligenci

...Diversity: 1.2
..

## Results
At epoch 49 we acheived a loss of 1.0348. with the sample text 

>"ew observation is received, that observation is classified based on previous experience.a classifier"

at a diversity of 0.2 we generated:

>s ai development (suppplies actions. archived from the original on 26 july 2016. retrieved 15 november 2016.
buten, nack; dack (2 detember 2019). "artificial intelligence and intelligence and the signal of the sampes and be approach as development and approach as the relationshure and approach as development and many and intelligent than most to solve ai research and intelligent will neural networ

at a diversity of 0.5 we generated:

>  s and thumboo of the able to a developic and that improved to commonsens that are sia dedictions intelligent formal of the spossible late the forts ai provicic problem to may computer of goals, lowc the term would to predictions that the term "intelligence". bbn firming and could bobing that is the samples (such as now the uther many and spoces the ability that they to a partics in a machines. arc

at a diversity of 1.0 we generated:
>  atous is oxfinery the possibico as mili ingeard most thing robots to aab  taching maximize as solved that pastics legine of the trooks ===
>
>
>=== robot, spix. wled systems.
are excessed not diffecte.
heieen combor". ai maying behavioral and cambriagled in the would emambergy, neural networks ade lief lecalived and pasal cauged from path oures to spearical report chiccudes retriegenth. objects, arti

and finally at adiversity of 1.2 we generated:
>  s deviced information purreding widely )ase". is normsed avi it is neural networks vary fiwl. archived from the originalion walizations dusin, lich results of wrivid eismen rights.
aroy plisof computing reasonings daper, ertyren 8 va via consmen freppbora to comored and naturopret of propesibing lawdla. .pronfornshies from a flowled interningiari'y hacd opencomput insticull known inscallepd is hou

From this we see that diversity 0.2 produced text that looks and reads the most realistic out of the 4.

## Improvements

To improve this model, we might try to remove the punctuation from the source text if we are focused on generating a paragraph reads realistically. We could also increase the training epochs, reduce the batch size or add more layers with different levels of memory. 
