In [1]:
import numpy as np
import pandas as pd
import nltk
import sklearn as sk
import re # regex
import matplotlib

## Train/Test Split

In [2]:
news_df = pd.read_csv('../datasets/news-20k-comments.csv')
politics_df = pd.read_csv("../datasets/politics-20k-comments.csv")
politicaldiscussion_df = pd.read_csv('../datasets/politicaldiscussion-20k-comments.csv')

raw_df = pd.concat([news_df, politics_df, politicaldiscussion_df]) # merge all three subreddit datas
raw_df.head()

Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,score_hidden,author_flair_css_class,author_flair_text,subreddit,id,...,downs,archived,author,score,retrieved_on,body,distinguished,edited,controversiality,parent_id
0,1430438402,-11.0,t5_2qh3l,t3_34f1lq,t1_cqug92b,0.0,,,news,cqug92b,...,0.0,0.0,hogsucker,-11.0,1432703000.0,1-She got to be a bigwig at Google by sleeping...,,0.0,0.0,t1_cqu4t11
1,1430438407,1.0,t5_2qh3l,t3_34exjb,t1_cqug96h,0.0,,,news,cqug96h,...,0.0,0.0,flal4,1.0,1432703000.0,For those about to lynch this guy [here](http:...,,0.0,1.0,t1_cqudz0p
2,1430438439,4.0,t5_2qh3l,t3_34f10p,t1_cqug9tk,0.0,,,news,cqug9tk,...,0.0,0.0,HitachinoBia,4.0,1432703000.0,It feels like black people are the most racist...,,0.0,1.0,t1_cqufsip
3,1430438448,0.0,t5_2qh3l,t3_34cvvg,t1_cquga1l,0.0,,,news,cquga1l,...,0.0,0.0,[deleted],0.0,1432703000.0,[deleted],,0.0,0.0,t3_34cvvg
4,1430438449,-10.0,t5_2qh3l,t3_34e7eo,t1_cquga1v,0.0,,,news,cquga1v,...,0.0,0.0,Cultiststeve,-10.0,1432703000.0,Its because otherwise thats all that would app...,,0.0,0.0,t1_cqudxkr


In [3]:
all_comments = raw_df.filter(['created_utc', 'subreddit', 'body'])

In [4]:
#filter out deleted comments
comments = all_comments[all_comments['body'] != "[deleted]"]

In [5]:
# clean out any urls and and brackets, parenthesis and hyphens, leaving only alphanumeric words
url_regex = r"([--:\w?@%&+~#=]*\.[a-z]{2,4}\/{0,2})((?:[?&](?:\w+)=(?:\w+))+|[--:\w?@%&+~#=]+)?"
special_character_regex = r"[\"'()[\]]"

comments['body'] = comments['body'].astype('str')

#remove urls, special characters, and replace hyphens with a space
comments['clean'] = comments['body'].apply(lambda text: text.strip().lower()).apply(lambda text: re.sub(url_regex, '', text)).apply(lambda text: re.sub(special_character_regex, '', text)).apply(lambda text: re.sub(r"-", ' ', text))
comments['clean'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comments['body'] = comments['body'].astype('str')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comments['clean'] = comments['body'].apply(lambda text: text.strip().lower()).apply(lambda text: re.sub(url_regex, '', text)).apply(lambda text: re.sub(special_character_regex, '', text)).apply(lambda text: re.sub(r"-", ' ', text))


0    1 she got to be a bigwig at google by sleeping...
1    for those about to lynch this guy here is a sh...
2    it feels like black people are the most racist...
4    its because otherwise thats all that would app...
5    please go to facebook and comment and post on ...
Name: clean, dtype: object

In [6]:
comments['tokens'] = comments['clean'].apply(lambda text: re.sub(r"[.,!?]"," ", text)).apply(lambda text: re.sub(r"[0-9]", " ", text)).apply(nltk.wordpunct_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comments['tokens'] = comments['clean'].apply(lambda text: re.sub(r"[.,!?]"," ", text)).apply(lambda text: re.sub(r"[0-9]", " ", text)).apply(nltk.wordpunct_tokenize)


In [7]:
comments = comments.reset_index(drop=True)
print(len(comments), "total comments")
comments.head()

55567 total comments


Unnamed: 0,created_utc,subreddit,body,clean,tokens
0,1430438402,news,1-She got to be a bigwig at Google by sleeping...,1 she got to be a bigwig at google by sleeping...,"[she, got, to, be, a, bigwig, at, google, by, ..."
1,1430438407,news,For those about to lynch this guy [here](http:...,for those about to lynch this guy here is a sh...,"[for, those, about, to, lynch, this, guy, here..."
2,1430438439,news,It feels like black people are the most racist...,it feels like black people are the most racist...,"[it, feels, like, black, people, are, the, mos..."
3,1430438449,news,Its because otherwise thats all that would app...,its because otherwise thats all that would app...,"[its, because, otherwise, thats, all, that, wo..."
4,1430438456,news,Please go to Facebook and comment and post on ...,please go to facebook and comment and post on ...,"[please, go, to, facebook, and, comment, and, ..."


## Building the RNN

In [124]:
len(comments['clean'])

55567

In [125]:
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)


physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

In [126]:
all_text = ""
for comment in comments['clean'][:5000]:
    all_text += "<" + comment + ">"

In [127]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(all_text)

In [128]:
max_id = len(tokenizer.word_index)
dataset_size = tokenizer.document_count
print("max id:", max_id, "\ndataset_size:", dataset_size)

max id: 81 
dataset_size: 1087981


In [129]:
[encoded] = np.array(tokenizer.texts_to_sequences([all_text])) - 1

In [130]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

In [131]:
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

In [132]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [133]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [134]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [135]:
dataset = dataset.prefetch(1)

In [136]:
model = tf.keras.models.Sequential([
    tf.keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], dropout=0.2),
    tf.keras.layers.GRU(128, return_sequences=True, dropout=0.2),
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(max_id))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_4 (GRU)                  (None, None, 128)         81024     
_________________________________________________________________
gru_5 (GRU)                  (None, None, 128)         99072     
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 81)          10449     
Total params: 190,545
Trainable params: 190,545
Non-trainable params: 0
_________________________________________________________________


In [138]:
history = model.fit(dataset, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [139]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

In [140]:
X_new = preprocess(["peopl"])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_texts(Y_pred+1)[0][-1]

'e'

In [146]:
def generate_char(text, temperature=1):
    X_new = preprocess([text])
    Y_pred = model.predict(X_new)[0, -1:, :]
    rescaled = tf.math.log(Y_pred)/temperature
    char_id = tf.random.categorical(rescaled, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [147]:
def generate(text, n=50, temperature=1):
    for _ in range(n):
        text += generate_char(text, temperature)
    return text

In [155]:
print(generate("i", n=100, temperature=0.2))

it i in int at a tant an ci  ane the pei  e an o cere the tar ant c an hat the a perti t in t at the 


# Evaluation

In [254]:
boundary = int(len(comments)*0.8) #80/20 train/test split
test = comments[boundary:]

# Response to 1. Baseline Model Performance Metrics (Part 1)
We've chosen two metrics to evaluate our model's performance on. The first is arguably the most simple, but it is precision plain and simple. We can define precision in our context of generated sentences to be if a token is present in the test vocabulary, and an imprecision is when it's not. This is a good heuristic for a baseline model, as when we improve our model we expect our precision to improve to a certain plateau, as all words in the test set may not occur in the train set and vica-versa.

A precision metric is useful for our application because we seek to generate comments similar to other reddit comments, and therefore seek to use similar words as a normal reddit comment.

In [206]:
import nltk
test_vocab = nltk.lm.Vocabulary([word for sent in test['tokens'] for word in sent])

In [207]:
import random
evaluation_sentences = []
sample_starts = ["a", "the", "well", "i", "possibly", "this"]
number_to_generate = 10
for i in range(number_to_generate):
    start_token = random.choice(sample_starts)
    evaluation_sentences.append(generate(start_token, n=100, temperature=0.2))

In [208]:
evaluation_sentences

['possibly t e tent an o be hon the rene and i c   an w a   a c   on th sace to the dent he the en the the c  ',
 'well an out in ere the ne the the the ten the no tus t en the ant the nest b  an the c ane ance c  to c ',
 'the wat the in on th t per at the to the the about th t the the the a c  an the the des the a a c  tect',
 'possibly an the c nie b it t at the ton th t it b to m and an a te tant o the lont are an it o  the the a th',
 'this the b me an the the tan the tine t the tn to the to c  the tan c  b the c as a c   the a c   ant in',
 'well the ment the han acou  e the sace a w nt t it in c   on th thice   on th t at the her to the pent o',
 'well a c all ent a to the c oned an the dent a l to in an the in an the tere the pent aire the no c  an ',
 'at ance she ant a ti  t tan a c   in t i  er an in t the a to tune th t at o con ane the the ten c  b',
 'this an the c on the c pininal a c an to the c in w be the to a to a to a tall a c   on th tuite c  the ',
 'possibly an a

In [209]:
# calculating precision
generated_tokens = [token for sent in evaluation_sentences for token in sent.split()]

true_positive = -10 #we know that 10 of these generated words are already in test_vocab
false_positive = 0
for word in generated_tokens:
    if word in test_vocab:
        true_positive += 1
    else:
        false_positive += 1

In [210]:
precision = true_positive / (true_positive + false_positive)
print("Our precision is", precision)

Our precision is 0.8825503355704698


### Discussion on Precision Metric

We evaluate our precision not on the entire dataset, but only on the last 20% of our dataset (the test partition). We've sorted our data chronologically, so the newest comments are used in the test dataset. This makes virtually no difference for our evaluation as it's only about 2 days worth of comments on reddit.

We achieve about 80% to 89% precision. This is really good, but we see that our evaluation sentences are pretty short. Precision doesn't measure fluent mimicry, just if we're hitting the right word tokens, and we are. More epochs and changing the RNN's hyper-parameters would lead to a better position.

# Response to 1. Baseline Model Performance Metrics (Part 2)
The second metric we've chosen is perplexity. We want a low perplexity, as that indicates a more cohesive model. Lower perplexity means we have a good model for predicting (or generating, really) text. Luckily for us, Tensorflow / Keras provides a neat little history object we can use to see how the loss function returns. Our loss function was `"sparse_categorical_crossentropy"`, and this is fortunate because if we take `e^loss` then we have our perplexity.

In [249]:
[latest_loss] = history.history['loss'][-1:]
perplexity = np.e ** latest_loss

In [250]:
print("Perplexity for our model: ", perplexity)
print("Perplexity of our test vocab: ", np.log(len(test_vocab)))

Perplexity for our model:  11.679503125466205
Perplexity of our test vocab:  9.980402296304254


### Discussion on Perpleixty Metric

We see that our perplexity of the model is 11.67. This is not fantastic, as the perplexity of the test vocabulary is 9.9. We want the perplexity of our model to be lower, since that would mean a less "confused" model and one that's more "confident" in it's predictions (generations). We want a model that is closer to that 9.9 perplexity as a benchmark.

# Discussion on the Model

Recurrent Neural Networks are inherently suited for natural language generation due to their usage of temporal context, which is essential in an application where we are generating (more specifically, predicting) characters. Concretely, we can generate characters based on the previous characters.


Our baseline model is very simple, and definitely could use work in crafting it to be better.

```
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_4 (GRU)                  (None, None, 128)         81024     
_________________________________________________________________
gru_5 (GRU)                  (None, None, 128)         99072     
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 81)          10449     
=================================================================
Total params: 190,545
Trainable params: 190,545
Non-trainable params: 0
_________________________________________________________________
```

The first two layers are GRU layers with 128 units each. We have a dropout rate of 20%, which helps with making sure no one neuron is overly relied upon. The output layer is interesting, as it's a time distributed dense layer that has output neurons that hold a bijective relationship with the elements of our vocabulary.