# Character level LSTM text generation

This is a statistical language model of the comments left on Reddit, in the Donald Trump subreddit. Using an LSTM neural network, it is possible to sample from the model to generate text that comes from a distribution designed to be similar to the distribution of text posted by reddit users. This is an attempt to simulate the average poster in The_Donald.   

Some of this code is inspired by Francois Chollet's "Deep learning with Python", a book that I would certainly recommend. 

The data is the same as that in the "The_Donald subreddit" notebook, where I have linked its source.

In [1]:
import keras
import pandas as pd
import numpy as np
import string
import random 

from keras import layers

Using TensorFlow backend.


In [2]:
data = pd.read_csv(r"D:\Data_sets\Reddit\The_Donald_nov17.csv")

In [3]:
filt1 = (data['selftext']!='[removed]')
filt2 = (data['selftext']!='[deleted]')
filt3 = (data['selftext'].notnull())

data = data[filt1&filt2&filt3]

In [4]:
corpus = []
accepted = string.ascii_lowercase + " " + string.digits + ",!\"\'#%():."

In [5]:
for row in data.itertuples():
    
    holder = []
    
    for i in getattr(row, "selftext").replace(",", " comma"):
        if i.lower() in accepted:
            holder.append(i)
            
    corpus.append("".join(holder))

In [6]:
len(corpus)

7242

In [7]:
corpus = " ".join(corpus)

In [8]:
corpus = corpus.replace(" comma", ",")

In [9]:
corpus[:500]

"Think about it. Still the same race, but we'd be 'brown'. Their heads would explode. They constantly say we hate 'brown' people. How would that work https:www.youtube.comwatchvUhoU7lINYxkhe is so cool, really. Order 10 large pizzas and take it to your local police department. You helping to support Papa John's for leaving that stupid NFL crap and you're also helping to support the men and women in blue. Let's make this happen My thoughts, condolences and prayers to the victims and families of th"

In [10]:
maxlen = 60
step = 3
sentences = []
next_chars = []

for i in range(0, len(corpus) - maxlen, step):
    sentences.append(corpus[i: i + maxlen])
    next_chars.append(corpus[i + maxlen])

In [11]:
len(sentences)

1162142

In [12]:
chars = sorted(list(set(corpus)))

In [13]:
char_indices = dict((char, chars.index(char)) for char in chars)

In [14]:
len(chars)

73

In [15]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

In [16]:
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [17]:
def sample(preds, temperature=0.5):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [18]:
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

In [19]:
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.RMSprop(lr=0.01))

In [20]:
for i in range(1,31):
    model.fit(x, y, batch_size=128, epochs=1)
    if i % 5 == 0:
        
        start_index = random.randint(0, len(corpus) - maxlen - 1)
        generated_text = corpus[start_index: start_index + maxlen]
        
        holder = []
        
        for i in range(300):
            
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1 
                
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds)
            next_char = chars[next_index]
            
            holder.append(next_char)
            
            generated_text += next_char
            generated_text = generated_text[1:]
            
        print("\nGenerating: \n\n", "".join(holder), "\n")

Epoch 1/1
Epoch 1/1
Epoch 1/1
Epoch 1/1
Epoch 1/1

Generating: 

 et a personal countrial something about the I don't despite the ready of the hold paying any and defectional and more did and the take the did the ready of this state to think and say that this is consenge state full to shoote that the way any of the state love for the control in the leaders and rec 

Epoch 1/1
Epoch 1/1
Epoch 1/1
Epoch 1/1
Epoch 1/1

Generating: 

  people in the political second living the mor sult picture that we don't say it can be might, they should say on the other concerned the factle and his population for the man in and call he are in the decided to have and can be a stuff to the current report with the government we shall the way supp 

Epoch 1/1
Epoch 1/1
Epoch 1/1
 249856/1162142 [=====>........................] - ETA: 9:03 - loss: 2.1035

KeyboardInterrupt: 