In this we are going to use RNNs(Recurrent Neural Networks) and specifically LSTM, i.e, Long Short Term Memory cells for generating character by character text.

# Importing all the Dependencies

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


Reading in the input text, we are going to train all the tweets at once, i.e., normally people feed tweets as separate entities for the model to train on, but my results when I tried to do so were very bad, mostly a single character was predicted each and every time, so I am still figuring out if it was due to some error I made or something else. So for now lets feed tweets together, where one tweet follows other as  a sentence follows another in a big text document

In [2]:
filename = "liver_for_generation.txt"
raw_text = open(filename, encoding="utf8").read()
raw_text = raw_text.lower()

In [3]:
len(raw_text)

408124

### Creating mapping from characters to integers

In [4]:
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

### Creating the reverse mapping from integer to character

In [13]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

The Reason for the poor performance of the text generated using tweets as an input for model to train on is due to this huge vocabulary. We can try and reduce the vocabulary, but then what we would get out of it wouldn't necessarily be a tweet but a simple text of which you cannot make sense of because we might have removed some emoji's or some exclamation mark which might be important for understanding what message a person is trying to convey

In [5]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("total characters: ", n_chars)
print("total vocab: ", n_vocab)

total characters:  408124
total vocab:  263


## Creating the different patterns
These patterns are what we feed into the RNN network. The network then trains and tries to learn how all these patterns can be used to generate next character by creating a probabilistic model. Like a particular pattern occurs a lot and the character occuring after it always, let's say 'a', then the model will almost be sure to output 'a' when it sees this pattern.
But in most cases, there wouldn't be such patterns which lead to a sure shot character, so all the characters in the vocab get probabilities for different patterns and these probabilities are used by the model to get the most probable character and outputs it

In [None]:
seq_length = 140
data_input_X = []
data_out_Y = []

In [6]:
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i+seq_length]
    seq_out = raw_text[i+ seq_length]
    data_input_X.append([char_to_int[char] for char in seq_in])
    data_out_Y.append([char_to_int[seq_out]])
n_patterns = len(data_input_X)
print("total patterns: ", n_patterns)

total patterns:  407984


### Reshaping the input data so as to make it a shape what a sequential model can intake

You can say till now, all the data_input_X were columns, now they have been reshaped in proper rows*columns dimensions

In [7]:
X = np.reshape(data_input_X, (n_patterns, seq_length, 1))
X = X/ float(n_vocab)
y = np_utils.to_categorical(data_outY)

## Model Architecture
The first layer is an LSTM network layer, followed by a dropout layer and a final dense layer for making the decisions using Softmax function

In [8]:
model = Sequential()
model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

## Storing the weights of the model for which the loss is less as compared to the last one

In [9]:
filepath="weights-improvement-{epoch:02d}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# Only training for 3 Epochs, as it takes 2 hours for each one to be done with!

In [11]:
model.fit(X, y, epochs=3, batch_size=128, callbacks=callbacks_list)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1f60b44d8d0>

## Taking the best model

In [12]:
filename = "weights-improvement-03.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [16]:
import sys

# Generating tweets using a random seed as a start

In [None]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = data_iput_X[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

In [17]:
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    res = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(res)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
"  going to be cancelle… football365: mails: time for flying liverpool to drop emre can https://t.co/dsoa47qs6cbig wins in midweek for city an "
 coee ao laa  aac cite po lee  aac cite po lee  aac lite po  lice lo tee he lee  aac  ht caa  hic hi
Done.


In [18]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = data_iput_X[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    res = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(res)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" or liverpool instead of arsenal… https://t.co/ymp7bfuxpsrt @lfc: three points to take back home! 🔴 https://t.co/iuxjrty3dagreat choice @bbcr "
 la  hiverpool hat rit  afc cite po lee  aac cite po lee  aac cite po lee  aac cite po lee  aac cite po lee  aac cite po lee  aac cite po le
Done.


In [19]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = data_iput_X[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    res = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(res)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" irgin trains1f14 12:05 euston to liverpool lime streetrt @josh_pne: these are women who just want to be able to play football. we should be  "
liee  hac hite pot  afte to lee  aat cit spe lene  l coals 
aa lone aa goae no lee  aac cite pot re 
Done.


In [20]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = data_iput_X[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    res = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(res)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
"  kick everton, they will. those same people who whitewash lfc time after time. w… https://t.co/bcwjdwbf2xrt @lfc: 8⃣ days
🔟 goals scored
1⃣  "
goals 
1

🎯 gaals 
-

aaseaad mttps://t.co/elnhh0dkdlrt @lfc: 8⃣ gaals 
1

aare aa moal  lag cit  -t
Done.


In [21]:
start = numpy.random.randint(0, len(dataX)-1)
pattern = data_iput_X[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
for i in range(100):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    pred = model.predict(x, verbose=0)
    index = numpy.argmax(pred)
    res = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(res)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" : watching the boys @lfc here in indonesia now before our masters game tomorrow against arsenal. give us the 3 points pl…@jackbmontgomery sc "
te se tee  aac lite po  hice lo tee he loe sene  ha  hiverpool 4aa  oiverpool 4-0 aisenpl laa cite 9
Done.
