In [358]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from nltk import FreqDist

## Import text data

The data is currently one text file, with each line corresponding to one post. The method for extraction from reddit is detailed in the scraper file. 

The data will first be explored.

In [359]:
file_path = '../data/relationships_10000.txt'

with open(file_path, 'r') as file:
    raw_relationship_data = file.read()
    print("file imported")

file imported


In [360]:
raw_relationship_data[:100]

'I Had A Dream About A Past Flame And Woke Up Missing Them - Am I Crazy?\nSister [11f] sleeps beside m'

Based on this first post this raises an interesting point about capital letters. I assumed that we wouldn't need to lowercase all the data, but if this Capitalised Every Word syntax is prevelant then this could be an issue. We will assume that lowercasing will produce a more informative model due to uniformity and lower likelyhood of Out Of Vocab words.

We are going to explore the punctuation in the text as a whole to see what may be insignificant.

In [361]:
print("Count of punctuations")
print(r"\n  ", raw_relationship_data.count("\n"))
print(r".  ", raw_relationship_data.count("."))
print(r",  ", raw_relationship_data.count(","))
print(r":  ", raw_relationship_data.count(":"))
print(r";  ", raw_relationship_data.count(";"))
print(r"\t  ", raw_relationship_data.count("\t"))
print(r"?  ", raw_relationship_data.count("?"))
print(r"!  ", raw_relationship_data.count("!"))
print(r"-  ", raw_relationship_data.count("-"))
print(r"(  ", raw_relationship_data.count("("))
print(r")  ", raw_relationship_data.count(")"))
print(r":(  ", raw_relationship_data.count(":("))
print(r":)  ", raw_relationship_data.count(":)"))
print(r"</3  ", raw_relationship_data.count("</3"))
print(r"[  ", raw_relationship_data.count("["))
print(r"]  ", raw_relationship_data.count("]"))
print(r"'  ", raw_relationship_data.count("'"))
print(r'"', raw_relationship_data.count('"'))
print(r"<  ", raw_relationship_data.count("<"))
print(r"_  ", raw_relationship_data.count("_"))


Count of punctuations
\n   10000
.   4417
,   1809
:   165
;   33
\t   0
?   3672
!   328
-   503
(   7018
)   6990
:(   27
:)   1
</3   1
[   2964
]   2965
'   2309
" 291
<   3
_   31


This shows that we have exactly the right number of `\n` symbols. The other punctuation may not be relevant as there are not huge numbers of non full stops, question marks and (maybe?) commas.

## Clean data

We want the data to take into account certain grammatical and punctuation syntax. Therefore we are going to map certain symbols to another, and to indicate where the end of a sentence is. It must be ensured that there are adequate spaces between relevant tokens or they won't be parse properly. 

The punctuation that is going to be kept in is:

* full stops
* question marks
* brackets (one type)

We are going to convert the text to lower case for all words in order to increase the uniformity of the text.

The newline `/n` symbol is going to be converted to ` <END> ` to indicate the end of a post (using the assumtion that posts are one line per post).

Should probably be using regular expressions here for better performance but alas this is a first run.

### Lowercase the data


In [362]:
raw_relationship_data = raw_relationship_data.lower()
print(raw_relationship_data[:100])

i had a dream about a past flame and woke up missing them - am i crazy?
sister [11f] sleeps beside m


### Add spaces to the punctuation we want to keep


In [363]:
raw_relationship_data = raw_relationship_data.replace("<", " ")
raw_relationship_data = raw_relationship_data.replace(">", " ")


raw_relationship_data = raw_relationship_data.replace("\n", " <END> <START> ")
raw_relationship_data = raw_relationship_data.replace(".", " . ")
raw_relationship_data = raw_relationship_data.replace("?", " ? ")
raw_relationship_data = raw_relationship_data.replace(",", " , ")

raw_relationship_data = raw_relationship_data.replace("[", " (")
raw_relationship_data = raw_relationship_data.replace("]", ") ")

raw_relationship_data = raw_relationship_data.replace(":", " ")
raw_relationship_data = raw_relationship_data.replace(";", " ")
#raw_relationship_data = raw_relationship_data.replace("-", " ")
raw_relationship_data = raw_relationship_data.replace("!", " ")
raw_relationship_data = raw_relationship_data.replace("_", " ")

raw_relationship_data = raw_relationship_data.replace('"', "")
raw_relationship_data = raw_relationship_data.replace("'", "")
raw_relationship_data = raw_relationship_data.replace("‚Äú", "")
raw_relationship_data = raw_relationship_data.replace('‚Äù', "")
raw_relationship_data = raw_relationship_data.replace('‚Äô', "")
raw_relationship_data = raw_relationship_data.replace('‚Ä¶', " ")
raw_relationship_data = raw_relationship_data.replace('...', " , ")
#raw_relationship_data = raw_relationship_data.replace('/', " ")





I gave up on not using regular expressions, we can check what non-alpha nums are still within the text.

In [364]:
import re
set(re.sub(r'[A-Za-z0-9 ]', '', raw_relationship_data))

{'#',
 '$',
 '%',
 '&',
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '<',
 '=',
 '>',
 '?',
 '@',
 '\\',
 '^',
 '{',
 '|',
 '}',
 '~',
 '¬ø',
 '√°',
 '√£',
 '√ß',
 '√©',
 '√™',
 '√¥',
 '√º',
 'ƒÉ',
 'ƒ±',
 '≈•',
 '–∞',
 '–≤',
 '–µ',
 '–∂',
 '–∏',
 '–∫',
 '–ª',
 '–º',
 '–æ',
 '—Ä',
 '—Å',
 '—Ç',
 '—Ö',
 '—á',
 '—ã',
 '—å',
 '·∫•',
 '·∫ª',
 '·∫ø',
 '·ªë',
 '·ª≠',
 '\u200d',
 '‚Äì',
 '‚Äî',
 '‚Äò',
 '‚Äû',
 '‚Ç¨',
 '‚ò∫',
 '‚ôÄ',
 '‚ôÇ',
 '‚ô°',
 '‚ô•',
 'Ô∏è',
 'ùêÜ',
 'ùêã',
 'ùêë',
 'ùêí',
 'ùêì',
 'ùêö',
 'ùêû',
 'ùêü',
 'ùê†',
 'ùê°',
 'ùê¢',
 'ùê§',
 'ùê•',
 'ùê¶',
 'ùêß',
 'ùê®',
 'ùê©',
 'ùê¨',
 'ùê≠',
 'ùüó',
 'üéπ',
 'üèª',
 'üèº',
 'üèΩ',
 'üëè',
 'üëß',
 'üíî',
 'üíï',
 'üíù',
 'üî•',
 'üòÖ',
 'üòî',
 'üòû',
 'üò©',
 'üò™',
 'üò¨',
 'üò≠',
 'üò≤',
 'ü§î',
 'ü§ï',
 'ü§¶',
 'ü§∑',
 'ü•µ',
 'ü•∫'}

From this we can see there is a wide range of punctuation that is not covered by our replacing procedure. We will remove all:

* alphanumerics
* full stops, commas, question marks
* characters in the `<END>` symbol

In [365]:
relationship_data = re.sub(r'^[A-Za-z0-9 <>,.?]', ' ', raw_relationship_data)
relationship_data = relationship_data.replace("  ", " ")
print(relationship_data[:200])

 had a dream about a past flame and woke up missing them - am i crazy ? <END> <START> sister (11f) sleeps beside my (26m) used t-shirts because it helps her sleep while im not at home . i find it unco


In [366]:
len("  had a dream about a past flame and woke up missing them   am i crazy ?  <END>")

79

That first post has gone wrong, for some reason due to the replacing or regular expressions. This problem with the initial "I" doesn't seem to be the case for the rest of the sentences. We will jsut strip the front. We end up keeping some parenthesis in as we want the (m23) type syntax, hopefully this will not impact the performance significantly.

In [367]:
relationship_data = relationship_data[77:]
print(relationship_data[:30])
#relationship_data = relationship_data + " <END>"

 <START> sister (11f) sleeps b


The (GENGER_AGE) syntax may be useful to replace with a generic placeholder in order to prevent rare / out of vocab issues, the model will end up predicting some age based on langauge.

Not quite sure where to tokenise this data, definitely before creating the sequences but not sure if the data should be sentences first.

Will go with before creating sentences.


### Tokenization

Separate the string into words using spaces to determine a new token. This will make punctuation tokens which is what we want for sentence structure.

Could use one of NLTK's casual tokenizer but as we have already preprocessed the strings for our own purpose the standard one may do fine. EDIT: as we have processed out words and punctuation to have whitespace where appropriate the WhitespaceTokenizer is best here.

In [368]:
ws_tk = WhitespaceTokenizer() 

relationships_word_tokened = ws_tk.tokenize(relationship_data)

print(relationships_word_tokened[:50])

['<START>', 'sister', '(11f)', 'sleeps', 'beside', 'my', '(26m)', 'used', 't-shirts', 'because', 'it', 'helps', 'her', 'sleep', 'while', 'im', 'not', 'at', 'home', '.', 'i', 'find', 'it', 'uncomfortable', 'but', 'also', 'im', 'not', 'sure', 'what', 'to', 'think', '.', 'is', 'this', 'normal/ok', '?', '<END>', '<START>', 'equality', 'in', 'relationship', '<END>', '<START>', 'r/relationship', 'i', 'need', 'your', 'perspective', 'and']


In [369]:
# this has to be done after tokenisation or it will count strings
vocab = sorted(set(relationships_word_tokened))
len_vocab = len(vocab)
print("Vocab length: ", len_vocab)

Vocab length:  7590


In [370]:
all_word_dist = FreqDist(word for word in relationships_word_tokened)
print(all_word_dist.most_common(50))

[('<START>', 10000), ('<END>', 9999), ('i', 6467), ('my', 5667), ('.', 4417), ('to', 4075), ('?', 3671), ('a', 2993), ('and', 2855), ('me', 2447), ('with', 2378), (',', 1809), ('is', 1623), ('of', 1555), ('the', 1458), ('do', 1380), ('how', 1359), ('for', 1302), ('in', 1225), ('boyfriend', 1118), ('am', 1039), ('on', 1017), ('it', 983), ('relationship', 962), ('im', 910), ('her', 883), ('friend', 813), ('what', 801), ('have', 760), ('about', 756), ('that', 726), ('girlfriend', 722), ('up', 714), ('dont', 713), ('but', 660), ('he', 653), ('know', 642), ('should', 629), ('not', 615), ('ex', 604), ('out', 598), ('she', 575), ('this', 571), ('or', 551), ('you', 534), ('him', 527), ('like', 521), ('want', 507), ('feel', 495), ('be', 493)]


Unsurprisingly many of our most common words are stop words, but these are important to our sentence structure so they will be kept in. 

We may choose the use the sentence structure of our data instead of a bag of words model, this will mean tokenising the sentences as well as words. I've done this kind of backwards as the `\n` strings denoted new posts previously but now we get a string for each post that has been cleaned.

In [378]:
relationship_data_sents = relationship_data.split(" <END> <START> ")
relationship_data_sents[0] = relationship_data_sents[0].replace("<START>", "")

print(relationship_data_sents[:10])


Max post length =  306 


['  sister (11f) sleeps beside my (26m) used t-shirts because it helps her sleep while im not at home . i find it uncomfortable but also im not sure what to think . is this normal/ok ?', 'equality in relationship', 'r/relationship i need your perspective and help', 'my (34f) (ex)boyfriend (40m) cheated on me last night- am i making the right decision ?', 'i (24m) react too intensely when my husband (23m) has a problem - how do i calm down ?', 'r/relationships i need your perspective', 'should i (24f) remain friends with my ex boyfriend (32m) ?', 'am i (m23) getting overly attached too quickly ?', 'how do i (24m) stop reacting so intensely ?', 'i (30f) have a weird (abusive ? ) relationship with my boss (36f) and may need to quit abruptly . no idea what to do']


In [381]:
relationship_data_sents_words = [ws_tk.tokenize(post) for post in relationship_data_sents]

print("Max post length: ", max([len(post) for post in relationship_data_sents_words]), "\n\n")

print(relationship_data_sents_words[0])

Max post length:  73 


['sister', '(11f)', 'sleeps', 'beside', 'my', '(26m)', 'used', 't-shirts', 'because', 'it', 'helps', 'her', 'sleep', 'while', 'im', 'not', 'at', 'home', '.', 'i', 'find', 'it', 'uncomfortable', 'but', 'also', 'im', 'not', 'sure', 'what', 'to', 'think', '.', 'is', 'this', 'normal/ok', '?']


We now have a list containing each post, within each post is a list of each token within the post. The longest post is 