In [169]:
import nltk

## Import text data

The data is currently one text file, with each line corresponding to one post. The method for extraction from reddit is detailed in the scraper file. 

The data will first be explored.

In [170]:
file_path = '../data/relationships_10000.txt'

with open(file_path, 'r') as file:
    raw_relationship_data = file.read()
    print("file imported")

file imported


In [171]:
raw_relationship_data[:100]

'I Had A Dream About A Past Flame And Woke Up Missing Them - Am I Crazy?\nSister [11f] sleeps beside m'

Based on this first post this raises an interesting point about capital letters. I assumed that we wouldn't need to lowercase all the data, but if this Capitalised Every Word syntax is prevelant then this could be an issue. We will assume that lowercasing will produce a more informative model due to uniformity and lower likelyhood of Out Of Vocab words.

We are going to explore the punctuation in the text as a whole to see what may be insignificant.

In [172]:
print("Count of punctuations")
print(r"\n  ", raw_relationship_data.count("\n"))
print(r".  ", raw_relationship_data.count("."))
print(r",  ", raw_relationship_data.count(","))
print(r":  ", raw_relationship_data.count(":"))
print(r";  ", raw_relationship_data.count(";"))
print(r"\t  ", raw_relationship_data.count("\t"))
print(r"?  ", raw_relationship_data.count("?"))
print(r"!  ", raw_relationship_data.count("!"))
print(r"-  ", raw_relationship_data.count("-"))
print(r"(  ", raw_relationship_data.count("("))
print(r")  ", raw_relationship_data.count(")"))
print(r":(  ", raw_relationship_data.count(":("))
print(r":)  ", raw_relationship_data.count(":)"))
print(r"</3  ", raw_relationship_data.count("</3"))
print(r"[  ", raw_relationship_data.count("["))
print(r"]  ", raw_relationship_data.count("]"))
print(r"'  ", raw_relationship_data.count("'"))
print(r'"', raw_relationship_data.count('"'))
print(r"<  ", raw_relationship_data.count("<"))
print(r"_  ", raw_relationship_data.count("_"))


Count of punctuations
\n   10000
.   4417
,   1809
:   165
;   33
\t   0
?   3672
!   328
-   503
(   7018
)   6990
:(   27
:)   1
</3   1
[   2964
]   2965
'   2309
" 291
<   3
_   31


This shows that we have exactly the right number of `\n` symbols. The other punctuation may not be relevant as there are not huge numbers of non full stops, question marks and (maybe?) commas.

## Clean data

We want the data to take into account certain grammatical and punctuation syntax. Therefore we are going to map certain symbols to another, and to indicate where the end of a sentence is. It must be ensured that there are adequate spaces between relevant tokens or they won't be parse properly. 

The punctuation that is going to be kept in is:

* full stops
* question marks
* brackets (one type)

We are going to convert the text to lower case for all words in order to increase the uniformity of the text.

The newline `/n` symbol is going to be converted to ` <END> ` to indicate the end of a post (using the assumtion that posts are one line per post).

Should probably be using regular expressions here for better performance but alas this is a first run.

### Lowercase the data


In [173]:
raw_relationship_data = raw_relationship_data.lower()
print(raw_relationship_data[:100])

i had a dream about a past flame and woke up missing them - am i crazy?
sister [11f] sleeps beside m


### Add spaces to the punctuation we want to keep


In [174]:
raw_relationship_data = raw_relationship_data.replace("<", " ")
raw_relationship_data = raw_relationship_data.replace(">", " ")


raw_relationship_data = raw_relationship_data.replace("\n", " <END> <START> ")
raw_relationship_data = raw_relationship_data.replace(".", " . ")
raw_relationship_data = raw_relationship_data.replace("?", " ? ")
raw_relationship_data = raw_relationship_data.replace(",", " , ")

raw_relationship_data = raw_relationship_data.replace("[", " (")
raw_relationship_data = raw_relationship_data.replace("]", ") ")

raw_relationship_data = raw_relationship_data.replace(":", " ")
raw_relationship_data = raw_relationship_data.replace(";", " ")
raw_relationship_data = raw_relationship_data.replace("-", " ")
raw_relationship_data = raw_relationship_data.replace("!", " ")
raw_relationship_data = raw_relationship_data.replace("_", " ")

raw_relationship_data = raw_relationship_data.replace('"', "")
raw_relationship_data = raw_relationship_data.replace("'", "")
raw_relationship_data = raw_relationship_data.replace("“", "")
raw_relationship_data = raw_relationship_data.replace('”', "")
raw_relationship_data = raw_relationship_data.replace('’', "")
raw_relationship_data = raw_relationship_data.replace('…', " ")
raw_relationship_data = raw_relationship_data.replace('...', " , ")
raw_relationship_data = raw_relationship_data.replace('/', " ")





I gave up on not using regular expressions, we can check what non-alpha nums are still within the text.

In [175]:
import re
set(re.sub(r'[A-Za-z0-9 ]', '', raw_relationship_data))

{'#',
 '$',
 '%',
 '&',
 '(',
 ')',
 '*',
 '+',
 ',',
 '.',
 '<',
 '=',
 '>',
 '?',
 '@',
 '\\',
 '^',
 '{',
 '|',
 '}',
 '~',
 '¿',
 'á',
 'ã',
 'ç',
 'é',
 'ê',
 'ô',
 'ü',
 'ă',
 'ı',
 'ť',
 'а',
 'в',
 'е',
 'ж',
 'и',
 'к',
 'л',
 'м',
 'о',
 'р',
 'с',
 'т',
 'х',
 'ч',
 'ы',
 'ь',
 'ấ',
 'ẻ',
 'ế',
 'ố',
 'ử',
 '\u200d',
 '–',
 '—',
 '‘',
 '„',
 '€',
 '☺',
 '♀',
 '♂',
 '♡',
 '♥',
 '️',
 '𝐆',
 '𝐋',
 '𝐑',
 '𝐒',
 '𝐓',
 '𝐚',
 '𝐞',
 '𝐟',
 '𝐠',
 '𝐡',
 '𝐢',
 '𝐤',
 '𝐥',
 '𝐦',
 '𝐧',
 '𝐨',
 '𝐩',
 '𝐬',
 '𝐭',
 '𝟗',
 '🎹',
 '🏻',
 '🏼',
 '🏽',
 '👏',
 '👧',
 '💔',
 '💕',
 '💝',
 '🔥',
 '😅',
 '😔',
 '😞',
 '😩',
 '😪',
 '😬',
 '😭',
 '😲',
 '🤔',
 '🤕',
 '🤦',
 '🤷',
 '🥵',
 '🥺'}

From this we can see there is a wide range of punctuation that is not covered by our replacing procedure. We will remove all:

* alphanumerics
* full stops, commas, question marks
* characters in the `<END>` symbol

In [176]:
relationship_data = re.sub(r'^[A-Za-z0-9 <>,.?]', ' ', raw_relationship_data)
relationship_data = relationship_data.replace("  ", " ")
print(relationship_data[:200])

 had a dream about a past flame and woke up missing them  am i crazy ? <END> <START> sister (11f) sleeps beside my (26m) used t shirts because it helps her sleep while im not at home . i find it uncom


In [177]:
len("  had a dream about a past flame and woke up missing them   am i crazy ?  <END>")

79

That first post has gone wrong, for some reason due to the replacing or regular expressions. This problem with the initial "I" doesn't seem to be the case for the rest of the sentences. We will jsut strip the front. We end up keeping some parenthesis in as we want the (m23) type syntax, hopefully this will not impact the performance significantly.

In [178]:
relationship_data = relationship_data[77:]
print(relationship_data[:30])

<START> sister (11f) sleeps be


The (GENGER_AGE) syntax may be useful to replace with a generic placeholder in order to prevent rare / out of vocab issues, the model will end up predicting some age based on langauge.

Not quite sure where to tokenise this data, definitely before creating the sequences but not sure if the data should be sentences first.

Will go with before creating sentences.


### Tokenization

Separate the string into words using spaces to determine a new token. This will make punctuation tokens which is what we want for sentence structure.

Could use one of NLTK's casual tokenizer but 

In [179]:
# this has to be done after tokenisation or it will count strings
vocab = sorted(set(relationship_data))
len_vocab = len(vocab)
print("Vocab length: ", len_vocab)


Vocab length:  153
