In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from nltk import FreqDist
import numpy as np
import tensorflow as tf


## Import text data

The data is currently one text file, with each line corresponding to one post. The method for extraction from reddit is detailed in the scraper file. 

The data will first be explored.

In [2]:
file_path = '../data/relationships_10000.txt'

with open(file_path, 'r') as file:
    raw_relationship_data = file.read()
    print("file imported")

file imported


In [3]:
raw_relationship_data[:100]

'I Had A Dream About A Past Flame And Woke Up Missing Them - Am I Crazy?\nSister [11f] sleeps beside m'

Based on this first post this raises an interesting point about capital letters. I assumed that we wouldn't need to lowercase all the data, but if this Capitalised Every Word syntax is prevelant then this could be an issue. We will assume that lowercasing will produce a more informative model due to uniformity and lower likelyhood of Out Of Vocab words.

We are going to explore the punctuation in the text as a whole to see what may be insignificant.

In [4]:
print("Count of punctuations")
print(r"\n  ", raw_relationship_data.count("\n"))
print(r".  ", raw_relationship_data.count("."))
print(r",  ", raw_relationship_data.count(","))
print(r":  ", raw_relationship_data.count(":"))
print(r";  ", raw_relationship_data.count(";"))
print(r"\t  ", raw_relationship_data.count("\t"))
print(r"?  ", raw_relationship_data.count("?"))
print(r"!  ", raw_relationship_data.count("!"))
print(r"-  ", raw_relationship_data.count("-"))
print(r"(  ", raw_relationship_data.count("("))
print(r")  ", raw_relationship_data.count(")"))
print(r":(  ", raw_relationship_data.count(":("))
print(r":)  ", raw_relationship_data.count(":)"))
print(r"</3  ", raw_relationship_data.count("</3"))
print(r"[  ", raw_relationship_data.count("["))
print(r"]  ", raw_relationship_data.count("]"))
print(r"'  ", raw_relationship_data.count("'"))
print(r'"', raw_relationship_data.count('"'))
print(r"<  ", raw_relationship_data.count("<"))
print(r"_  ", raw_relationship_data.count("_"))


Count of punctuations
\n   10000
.   4417
,   1809
:   165
;   33
\t   0
?   3672
!   328
-   503
(   7018
)   6990
:(   27
:)   1
</3   1
[   2964
]   2965
'   2309
" 291
<   3
_   31


This shows that we have exactly the right number of `\n` symbols. The other punctuation may not be relevant as there are not huge numbers of non full stops, question marks and (maybe?) commas.

## Clean data

We want the data to take into account certain grammatical and punctuation syntax. Therefore we are going to map certain symbols to another, and to indicate where the end of a sentence is. It must be ensured that there are adequate spaces between relevant tokens or they won't be parse properly. 

The punctuation that is going to be kept in is:

* full stops
* question marks
* brackets (one type)

We are going to convert the text to lower case for all words in order to increase the uniformity of the text.

The newline `/n` symbol is going to be converted to ` <END> ` to indicate the end of a post (using the assumtion that posts are one line per post).

Should probably be using regular expressions here for better performance but alas this is a first run.

### Lowercase the data


In [5]:
raw_relationship_data = raw_relationship_data.lower()
print(raw_relationship_data[:100])

i had a dream about a past flame and woke up missing them - am i crazy?
sister [11f] sleeps beside m


### Add spaces to the punctuation we want to keep


In [6]:
raw_relationship_data = raw_relationship_data.replace("<", " ")
raw_relationship_data = raw_relationship_data.replace(">", " ")


raw_relationship_data = raw_relationship_data.replace("\n", " <END> <START> ")
raw_relationship_data = raw_relationship_data.replace(".", " . ")
raw_relationship_data = raw_relationship_data.replace("?", " ? ")
raw_relationship_data = raw_relationship_data.replace(",", " , ")

raw_relationship_data = raw_relationship_data.replace("[", " (")
raw_relationship_data = raw_relationship_data.replace("]", ") ")

raw_relationship_data = raw_relationship_data.replace(":", " ")
raw_relationship_data = raw_relationship_data.replace(";", " ")
raw_relationship_data = raw_relationship_data.replace("-", " ")
raw_relationship_data = raw_relationship_data.replace("!", " ")
raw_relationship_data = raw_relationship_data.replace("_", " ")

raw_relationship_data = raw_relationship_data.replace('"', "")
raw_relationship_data = raw_relationship_data.replace("'", "")
raw_relationship_data = raw_relationship_data.replace("“", "")
raw_relationship_data = raw_relationship_data.replace('”', "")
raw_relationship_data = raw_relationship_data.replace('’', "")
raw_relationship_data = raw_relationship_data.replace('…', " ")
raw_relationship_data = raw_relationship_data.replace('...', " , ")
#raw_relationship_data = raw_relationship_data.replace('/', " ")





I gave up on not using regular expressions, we can check what non-alpha nums are still within the text.

In [7]:
import re
set(re.sub(r'[A-Za-z0-9 ]', '', raw_relationship_data))

{'#',
 '$',
 '%',
 '&',
 '(',
 ')',
 '*',
 '+',
 ',',
 '.',
 '/',
 '<',
 '=',
 '>',
 '?',
 '@',
 '\\',
 '^',
 '{',
 '|',
 '}',
 '~',
 '¿',
 'á',
 'ã',
 'ç',
 'é',
 'ê',
 'ô',
 'ü',
 'ă',
 'ı',
 'ť',
 'а',
 'в',
 'е',
 'ж',
 'и',
 'к',
 'л',
 'м',
 'о',
 'р',
 'с',
 'т',
 'х',
 'ч',
 'ы',
 'ь',
 'ấ',
 'ẻ',
 'ế',
 'ố',
 'ử',
 '\u200d',
 '–',
 '—',
 '‘',
 '„',
 '€',
 '☺',
 '♀',
 '♂',
 '♡',
 '♥',
 '️',
 '𝐆',
 '𝐋',
 '𝐑',
 '𝐒',
 '𝐓',
 '𝐚',
 '𝐞',
 '𝐟',
 '𝐠',
 '𝐡',
 '𝐢',
 '𝐤',
 '𝐥',
 '𝐦',
 '𝐧',
 '𝐨',
 '𝐩',
 '𝐬',
 '𝐭',
 '𝟗',
 '🎹',
 '🏻',
 '🏼',
 '🏽',
 '👏',
 '👧',
 '💔',
 '💕',
 '💝',
 '🔥',
 '😅',
 '😔',
 '😞',
 '😩',
 '😪',
 '😬',
 '😭',
 '😲',
 '🤔',
 '🤕',
 '🤦',
 '🤷',
 '🥵',
 '🥺'}

From this we can see there is a wide range of punctuation that is not covered by our replacing procedure. We will remove all:

* alphanumerics
* full stops, commas, question marks
* characters in the `<END>` symbol

In [8]:
relationship_data = re.sub(r'^[A-Za-z0-9 <>,.?]', ' ', raw_relationship_data)
relationship_data = relationship_data.replace("  ", " ")
print(relationship_data[:200])

 had a dream about a past flame and woke up missing them  am i crazy ? <END> <START> sister (11f) sleeps beside my (26m) used t shirts because it helps her sleep while im not at home . i find it uncom


In [9]:
len("  had a dream about a past flame and woke up missing them   am i crazy ?  <END>")

79

That first post has gone wrong, for some reason due to the replacing or regular expressions. This problem with the initial "I" doesn't seem to be the case for the rest of the sentences. We will jsut strip the front. We end up keeping some parenthesis in as we want the (m23) type syntax, hopefully this will not impact the performance significantly.

In [10]:
relationship_data = relationship_data[77:]
print(relationship_data[:30])
#relationship_data = relationship_data + " <END>"

<START> sister (11f) sleeps be


The (GENGER_AGE) syntax may be useful to replace with a generic placeholder in order to prevent rare / out of vocab issues, the model will end up predicting some age based on langauge.

Not quite sure where to tokenise this data, definitely before creating the sequences but not sure if the data should be sentences first.

Will go with before creating sentences.


### Tokenization

Separate the string into words using spaces to determine a new token. This will make punctuation tokens which is what we want for sentence structure.

Could use one of NLTK's casual tokenizer but as we have already preprocessed the strings for our own purpose the standard one may do fine. EDIT: as we have processed out words and punctuation to have whitespace where appropriate the WhitespaceTokenizer is best here.

In [11]:
ws_tk = WhitespaceTokenizer() 

relationships_word_tokened = ws_tk.tokenize(relationship_data)

print(relationships_word_tokened[:50])

['<START>', 'sister', '(11f)', 'sleeps', 'beside', 'my', '(26m)', 'used', 't', 'shirts', 'because', 'it', 'helps', 'her', 'sleep', 'while', 'im', 'not', 'at', 'home', '.', 'i', 'find', 'it', 'uncomfortable', 'but', 'also', 'im', 'not', 'sure', 'what', 'to', 'think', '.', 'is', 'this', 'normal/ok', '?', '<END>', '<START>', 'equality', 'in', 'relationship', '<END>', '<START>', 'r/relationship', 'i', 'need', 'your', 'perspective']


# all_word_dist = FreqDist(word for word in relationships_word_tokened)
# print(all_word_dist.most_common(50))

Unsurprisingly many of our most common words are stop words, but these are important to our sentence structure so they will be kept in. 

We may choose the use the sentence structure of our data instead of a bag of words model, this will mean tokenising the sentences as well as words. I've done this kind of backwards as the `\n` strings denoted new posts previously but now we get a string for each post that has been cleaned.

In [12]:
relationship_data_sents = relationship_data.split(" <END> <START> ")
relationship_data_sents[0] = relationship_data_sents[0].replace("<START>", "")
relationship_data_sents = [x for x in relationship_data_sents if x]


print(relationship_data_sents[:10])


[' sister (11f) sleeps beside my (26m) used t shirts because it helps her sleep while im not at home . i find it uncomfortable but also im not sure what to think . is this normal/ok ?', 'equality in relationship', 'r/relationship i need your perspective and help', 'my (34f) (ex)boyfriend (40m) cheated on me last night am i making the right decision ?', 'i (24m) react too intensely when my husband (23m) has a problem  how do i calm down ?', 'r/relationships i need your perspective', 'should i (24f) remain friends with my ex boyfriend (32m) ?', 'am i (m23) getting overly attached too quickly ?', 'how do i (24m) stop reacting so intensely ?', 'i (30f) have a weird (abusive ? ) relationship with my boss (36f) and may need to quit abruptly . no idea what to do']


In [13]:
relationship_data_sents_words = [ws_tk.tokenize(post) for post in relationship_data_sents]

MAX_SEQ_LENGTH = max([len(post) for post in relationship_data_sents_words])

relationship_data_sents_words = [x for x in relationship_data_sents_words if x]

MIN_SEQ_LENGTH = min([len(post) for post in relationship_data_sents_words])


print("Max post length: ", MAX_SEQ_LENGTH, "\n\n")
print("Min post length: ", MIN_SEQ_LENGTH, "\n\n")


print(relationship_data_sents_words[3])

Max post length:  73 


Min post length:  1 


['my', '(34f)', '(ex)boyfriend', '(40m)', 'cheated', 'on', 'me', 'last', 'night', 'am', 'i', 'making', 'the', 'right', 'decision', '?']


We now have a list containing each post, within each post is a list of each token within the post. The longest post is given by `MAX_SEQ_LENGTH`

### Generate vocab

In [14]:
import functools
import operator

flattened_word_tokened = functools.reduce(operator.concat, relationship_data_sents_words)

# this has to be done after tokenisation or it will count strings
vocab = sorted(set(flattened_word_tokened))
len_vocab = len(vocab) + 1
print("Vocab length: ", len_vocab)

Vocab length:  7434


We need to convert the word data into integers the model will be able to understand, a little bit cheating but keras has a nice way to do this.

In [15]:
from keras.preprocessing.text import Tokenizer

keras_embedder = Tokenizer(num_words=None, filters=[], lower=False, split=" ")

keras_embedder.fit_on_texts(relationship_data_sents)

embedded_sents = keras_embedder.texts_to_sequences(relationship_data_sents)

print(len(embedded_sents))

Using TensorFlow backend.


9999


We now have an embedding for each post. We can now make the train/
predict seqence pairs. We now have an ordered 

In [16]:
from keras.preprocessing.sequence import pad_sequences

embedded_sent_ = []
for post in embedded_sents:
    if len(post) > 3:
        embedded_sent_.append(post)

sequences = []
for post in embedded_sent_:
    for index in range(2, len(post)):
        single_sequence = post[index-2:index+1]
        sequences.append(single_sequence)

print("Total Sequences: {}".format(len(functools.reduce(operator.concat, sequences))))


Total Sequences: 374538


We want to ensure all our sequences are padded adequately, they should be now by prunning non-3 lengths.

In [17]:
padded_sequences = pad_sequences(sequences, maxlen=3, padding='pre')
        
print(padded_sequences.shape)

(124846, 3)


Convert the sequences to features + targets in order to train a model in a categorical manner.

In [18]:
# split into input and output elements
X = []
y = []
padded_sequences = [each for each in padded_sequences if each != []]
padded_sequences = [each for each in padded_sequences if len(each) > 1]
for each_post in padded_sequences:
    each_post = np.array(each_post)
    X_each_post, y_each_post = each_post[:-1], each_post[-1]
    X.append(X_each_post)
    y.append(y_each_post)

X = np.array(X)
y = np.array(y)

print(X.shape)
print(y.shape)
print(y[:20])

  after removing the cwd from sys.path.


(124846, 2)
(124846,)
[4283 4284    2  147  356 2378 4285   75   21 3014   24  447  194   23
   38   64  303    3    1  300]


In [19]:
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

In [20]:
number_of_embeddings = 50
LSTM_units = 256

# define model
model = Sequential()
model.add(Embedding(len_vocab, number_of_embeddings, input_length=2))
model.add(LSTM(LSTM_units))
model.add(Dense(len_vocab, activation='softmax'))
print(model.summary())




Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2, 50)             371700    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_1 (Dense)              (None, 7434)              1910538   
Total params: 2,596,606
Trainable params: 2,596,606
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
# compile network
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])





In [22]:
# fit network
num_epochs = 10

model.fit(X, y, epochs=num_epochs, verbose=1)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x10f3835c0>

Function that will convert from the embedded numbers to text with a number of predictions.

In [23]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

In [24]:
max_length = 3

test = "there is"

generate_seq(model, keras_embedder, max_length-1, test, 10)

'there is something wrong with me ? ? ? ? ? ?'

In [26]:
test1 = "My wife"
test2 = "My husband"
test3 = "My friend"
test4 = "My fiance"
test5 = "My (22M)"
test6 = "My girlfriend"
test7 = "My boyfriend"
test8 = "My partner"
test9 = "My (23F)"
test10 = "My spouse"

length = 25

print(generate_seq(model, keras_embedder, max_length-1, test1, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test2, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test3, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test4, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test5, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test6, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test7, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test8, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test9, length),"\n")
print(generate_seq(model, keras_embedder, max_length-1, test10, length),"\n")


My wife (34f) and i dont know what to do . . . . . . . . . . . . . . . . . 

My husband (30m) is having a hard time moving forward in the wrong ? chasing empathy . . . . . . . . . . . 

My friend (m19) has a girlfriend . . . . . . . . . . . . . . . . . . . . . 

My fiance (30m) of 7 months , but i have a lot of suspicion about the future ? 💔 and i dont know what to do . 

My (22M) over possibly being evaluated for autism and i dont know what to do . . . . . . . . . . . . 

My girlfriend (23/f) of 1 year , and i dont know what to do . . . . . . . . . . . . . 

My boyfriend (m19) doesnt like me ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 

My partner of 1 year , and i dont know what to do . . . . . . . . . . . . . . 

My (23F) over possibly being evaluated for autism and i dont know what to do . . . . . . . . . . . . 

My spouse (30m) is having a hard time moving forward in the wrong ? chasing empathy . . . . . . . . . . . 

