## Char Prediction using LSTM

#### Just the flow accuracy not proper

1. Download data of Alice in Wonderland or Dracula from https://www.gutenberg.org/browse/scores/top in plain text format
2. Create an char_to_int map which maps each character used in the novel to an integer. example {a: 3}
3. Read data from the text file and do the following:
    3.1 Create a sliding window in which it takes in first 100 characters as the input sequence and 101th character as the output sequence. (It slides over every character).
    For example: 
        "Avul Pakir Jainulabdeen Abdul Kalam better known as A.P.J. Abdul Kalam"
        You should slide from "A" to the 100th char and 101th char will be your output.
        Then you should start sliding from "v" to the 100th char and 101th char will be your output.
    The input and the output sequence should be converted to their integer representation using the char_to_int map.
    With this you basically have two arrays seqIn and seqOut with each element containing integer representation of 100 characters and 1 character respectively.
    seqIn = [[10........15], [5.....25]...] seqOut = [5, 2, 5]
4. Now reshape your seqIn as (NumberOfSamples, 100, 1) - So you basically get this [[[10]........[15]], [[5]..... [25]]...]
5. One hot encode your seqOut using np_utils.to_categorical

6. Now create a simple model with LSTM followed by a Dense layer.

7. Then, given a seed sentence predict the next character using the model created.


# Imports

In [1]:
import string
import numpy as np
import keras
import pandas as pd
import nltk
import h5py
from keras.layers import Dense, LSTM, Activation, Dropout
from keras.utils import to_categorical
from keras.models import Sequential
from keras.optimizers import rmsprop
from keras.activations import relu,softmax
import plotly.offline as plot
import plotly.graph_objs as go
plot.offline.init_notebook_mode(connected = True)

Using TensorFlow backend.


# Reading the file

In [2]:
get_file = open('data/Alice in Wonderland.txt','r')
data = get_file.read()

# String Manipulations(lower conversion and removing punctuations)

In [3]:
data = data.lower()
chars = sorted(list(set(data)))
filtered_chars = [char for char in chars if char not in set(string.punctuation)]

## Doing more filtering

In [4]:
more_exclude = [filtered_chars[42],filtered_chars[41],filtered_chars[40],filtered_chars[39],filtered_chars[38],filtered_chars[0]]
data = [chars for chars in data if chars not in more_exclude]
filtered_chars = [chars for chars in filtered_chars if chars not in more_exclude]

## Converting above filtered data into dictionary

In [5]:
char_to_int = {}
for index, char in enumerate(filtered_chars):
    char_to_int[char] = index

# Applying filters to data

In [6]:
data = [chars for chars in data if chars in filtered_chars]

# char_to_int mapping in data

In [7]:
int_data = []
for char in data:
    int_data.append(char_to_int[char])
loop_till = len(int_data) - 101

# Getting input and output data

In [8]:
x = []
input_data = []
output_data = []
for index in range(loop_till):
    sequence = int_data[index : index + 100]
    input_data.append(sequence)
    y = int_data[index + 101]
    output_data.append(y)

# Splitting data into x_train/y_train and x_test/y_test as (85-15)

In [9]:
splitting_length = int(len(input_data) * .85)
x_train = input_data[:splitting_length]
x_test = input_data[splitting_length:]
y_train = output_data[:splitting_length]
y_test = output_data[splitting_length:]

# Making x data and y_data into numpy array

In [10]:
x_train = np.array(x_train).reshape(len(x_train),100,1)
x_test = np.array(x_test).reshape(len(x_test),100,1)
y_train = np.array(y_train)
y_test = np.array(y_test)

# Variables/Hyperparamters

In [11]:
epochs = 1
batch_size = 128
num_classes = len(filtered_chars)

 # One Hot encoding for y data

In [12]:
y_train = to_categorical(y_train, num_classes=num_classes)
y_test = to_categorical(y_test,num_classes = num_classes)

# Model creation Sequence Size = 100 and epochs = 50

In [14]:
model = Sequential()
model.add(LSTM(256, input_shape=x_train[0].shape,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(num_classes))
model.add(Activation(softmax))
model.load_weights('weights.hdf5', by_name=True) # using pre-trained weights
model.compile(optimizer='rmsprop',metrics=['accuracy'],loss='categorical_crossentropy')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 37)                9509      
_________________________________________________________________
activation_2 (Activation)    (None, 37)                0         
Total params: 799,013
Trainable params: 799,013
Non-trainable params: 0
_________________________________________________________________


# Model Evaluation

In [15]:
model.evaluate(x_test,y_test)



[3.6716466946448034, 0.021889756829515866]

# Model/Char Prediction for a sentence

In [16]:
int_to_char = {value:key for key, value in char_to_int.items()}

In [17]:
seed = x_train[0]
sent = ''.join([int_to_char[int(char)] for char in seed])
print(sent)
for i in range(20):
    pred_char = model.predict_classes(np.array(seed).reshape(1,100,1))
    print('Character ',int_to_char[int(pred_char)])
    seed = np.roll(seed,-1)
    seed[-1] = pred_char
    sent = ''.join([int_to_char[int(char)] for char in seed])
    print(sent)

project gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyo
Character  g
roject gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyog
Character  g
oject gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyogg
Character  g
ject gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyoggg
Character  g
ect gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyogggg
Character  g
ct gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyoggggg
Character  g
t gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyogggggg
Character  g
 gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of anyoggggggg
Character  g
gutenbergs alices adventures in wonderland by lewis carrollthis ebook is for the use of 

# Character wise plot(BiGrams for character)

In [18]:
following = nltk.ConditionalFreqDist(nltk.bigrams(data))
follow = pd.DataFrame.from_dict(following)
follow = follow.fillna(value=0,axis = 1)
mat = follow.as_matrix()
trace = go.Heatmap(z = mat, x = list(follow.index) ,y = follow.columns)
layout = go.Layout(yaxis=dict(autotick=False))
figure = go.Figure(data=[trace], layout=layout)
plot.iplot(figure)