# Char Prediction using LSTM

1. Download data of Alice in Wonderland or Dracula from https://www.gutenberg.org/browse/scores/top in plain text format
2. Create an char_to_int map which maps each character used in the novel to an integer. example {a: 3}
3. Read data from the text file and do the following: 3.1 Create a sliding window in which it takes in first 100 characters as   the input sequence and 101th character as the output sequence. (It slides over every character). For example:
         "Avul Pakir Jainulabdeen Abdul Kalam better known as A.P.J. Abdul Kalam"
         You should slide from "A" to the 100th char and 101th char will be your output.
         Then you should start sliding from "v" to the 100th char and 101th char will be your output.
   The input and the output sequence should be converted to their integer representation using the char_to_int map. With this you basically have two arrays seqIn and seqOut with each element containing integer representation of 100 characters and 1 character respectively. seqIn = [[10........15], [5.....25]...] seqOut = [5, 2, 5]
4. Now reshape your seqIn as (NumberOfSamples, 100, 1) - So you basically get this [[[10]........[15]], [[5]..... [25]]...]
5. One hot encode your seqOut using np_utils.to_categorical
6. Now create a simple model with LSTM followed by a Dense layer.
7. Then, given a seed sentence predict the next character using the model created.

## Importing Packages

In [80]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import keras
# Sequence to attain Padding
from keras.preprocessing import sequence
# Importing RNN's LSTM
from keras.layers import LSTM, Dense, Dropout
from keras.layers import Embedding
# Applying Sequential algorithm to model
from keras.models import Sequential

## Storing the Document

In [4]:
file = open('AliceinWonderland.txt').read()

In [7]:
# Removing all '\n' from the document
file = file.replace('\n',' ').replace('\r','')

## Calculation the number of unique letter in the document

In [11]:
# Stores the unique letters from the document
chars = list(set(file))

# Stores the number of unique letters which is the num_classes in outputs
unique_chars_count = len(chars)
print(unique_chars_count)

85


## convert text letters to int

In [55]:
# Neural Networks accepts only number inputs, so converting text(letters) into numbers

## Maps letters to numbers
char_to_int = dict(zip(chars, [i for i in range(len(chars))]))

## Maps numbers back to text
int_to_char = dict(zip([i for i in range(len(chars))], chars ))

In [56]:
''' SLIDING FUNCTION: Slides over the input text file character by character'''

def slider(data, slide):
    x = []
    y = []
    for i in range(len(data)-slide):
        x.append([char for char in data[i:i+slide]])
        y.append(data[i+slide])
    return x,y

In [60]:
''' CHAR TO INT CONVERSION FUNCTION: Converts character dataset to int dataset '''

def char_data_to_int_data(x,y, char_to_int):
    input_int = []
    output_int = []
    
    for i in range(len(x)):
        input_int.append([char_to_int[char] for char in x[i]])
        output_int.append([char_to_int[char] for char in y[i]])
    return input_int,output_int

In [61]:
''' INTIALIZATION FUNCTION: Accepts tokenized words, slide, list of unique words from the doc '''

def main(data, slide, char_to_int):
    x, y = slider(data, slide)
    input_int, output_int = char_data_to_int_data(x, y, char_to_int)
    output_int = list(np.array(output_int).flatten())
    input_int = np.array(input_int).reshape(len(input_int),slide,1)
    return input_int,output_int

## Initializing

In [62]:
X,Y = main(file,100,char_to_int)

In [64]:
''' X=(163716, 100, 1) 

    Number of samples = 163716
    Number of inputs  = 100 (Letter1, Letter2...., Letter100)
               Output = 1 (Letter101th)
'''


X.shape

(163715, 100, 1)

In [66]:
len(Y)

163715

In [68]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.01, random_state=1)

### One-Hot-Encoding Output Values

In [76]:
# Total no. of classes = Unique Values in the document, [0,0,0,.....1]
y_train_oneHotEncoded = keras.utils.to_categorical(y_train, num_classes=num_classes)
y_test_oneHotEncoded = keras.utils.to_categorical(y_test, num_classes=num_classes)

In [77]:
x_train.shape

(162077, 100, 1)

In [78]:
y_train_oneHotEncoded.shape

(162077, 85)

In [79]:
unique_chars_count

85

## LSTM Model

In [101]:
model = Sequential()
model.add(LSTM(64,input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=True))
model.add(LSTM(128))
model.add(Dense(unique_chars_count, activation="sigmoid"))
## Compiling Model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

## Fitting Model without weights(Wr or Wht-1)
model.fit(x_train, y_train_oneHotEncoded, batch_size=batch_size, epochs=1, validation_data=(x_test, y_test_oneHotEncoded))

Train on 162077 samples, validate on 1638 samples
Epoch 1/1


<keras.callbacks.History at 0x192d46c6b38>

In [114]:
### Loading Weights
#model.load_weights('weights-improvement-49-1.2575.hdf5', by_name=False)

In [None]:
### Loading Weights
#model.add(Dropout(32, input_shape=(x_train.shape[1], x_train.shape[2]))

In [102]:
predict = model.predict(x_test)

In [103]:
evaluate = model.evaluate(x_test, y_test_oneHotEncoded)



In [104]:
accuracy = evaluate[1]
accuracy*100

29.853479853479854

In [105]:
test = "Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: Alice’s Adventures in Wonderland Author: Lew"

In [106]:
test_x, test_y = main(test,100,char_to_int)

In [107]:
test_x.shape

(271, 100, 1)

In [108]:
pre = model.predict_classes(test_x)

In [109]:
output = []
actual = []
for i,j in zip(pre,test_y):
    output.append(int_to_char[i])
    actual.append(int_to_char[j])

In [110]:
for i, j in zip(output,actual):
    print("predicted : ",i," Actual : ",j)

predicted :  t  Actual :  a
predicted :  n  Actual :  n
predicted :     Actual :  y
predicted :     Actual :  o
predicted :     Actual :  n
predicted :     Actual :  e
predicted :     Actual :   
predicted :  t  Actual :  a
predicted :  n  Actual :  n
predicted :     Actual :  y
predicted :     Actual :  w
predicted :     Actual :  h
predicted :     Actual :  e
predicted :     Actual :  r
predicted :     Actual :  e
predicted :     Actual :   
predicted :  t  Actual :  a
predicted :  n  Actual :  t
predicted :     Actual :   
predicted :  t  Actual :  n
predicted :     Actual :  o
predicted :     Actual :   
predicted :  t  Actual :  c
predicted :  a  Actual :  o
predicted :  i  Actual :  s
predicted :     Actual :  t
predicted :     Actual :   
predicted :  t  Actual :  a
predicted :  n  Actual :  n
predicted :     Actual :  d
predicted :     Actual :   
predicted :  t  Actual :  w
predicted :  a  Actual :  i
predicted :  t  Actual :  t
predicted :     Actual :  h
predicted :     Actu