### Load  the dataset

The text 8 dataset for training the word_2_vector model is taken from (_http://mattmahoney.net/dc/_), if the zip file is missing it is downloaded again. Roughly 200k unique words in the sample of senteces with roughly 17M words. Download can take a while. Training is then done in run_Word2Vec module. Load the interface to the model as well

In [1]:
import Load_Text_Set as l_data
import run_Word2Vec as w2v

File is missing, retrieving  http://mattmahoney.net/dc/text8.zip
Found and verified text8.zip 
Words in file  17005207


## Train the model
Run the training and return the final normalized embeddings. The model can be trained useing skipgram or continuos bag of string CBOW, controlled by a switch in the file _tf_W0rd2Vec.py_. Othe paramaters of the model are also hard coded in here. 

In [10]:
words = l_data.text_8(200000)
embeddings = w2v.run_embeddings()

Found and verified text8.zip 
Words in file  17005207
Initialized embeddings
(200000, 128)
Initialized
Average loss at step 0: 9.255837
Nearest to by: chickpea, device, filming, compound, homebase, stergiopoulos, jahrb, durante,
Nearest to people: alexandrov, cherno, slipper, phaenomenologica, getaway, kiskil, lushan, ajiva,
Nearest to four: karachaganak, kirsan, utility, ngen, mucel, arguers, stilpo, kirundo,
Nearest to can: tylos, berdan, gnd, trollish, aumeier, brilliants, specte, lorenzattractor,
Nearest to d: pk, enumerate, surr, valkyries, succoth, vakatakas, onesicritus, mirv,
Nearest to been: burgundes, mudhol, chene, schemer, fiurenzu, lavey, tarcher, pastries,
Nearest to so: hydrocactus, perata, mora, cursing, dkos, lookalikes, diaphragmic, langland,
Nearest to while: halite, butkus, zapp, chdir, mascott, inbox, toolchain, grunt,
Nearest to seven: sprengel, leavelle, pertain, improvisers, sportbund, friable, ionization, backlighting,
Nearest to were: skvirsky, cushioned, lant

## Some crude attempts at sentiment analysis
As part of the project we would like to associate the mood to a memory by assigning an value between 0 and 1 to a series of different moods:

* This study incleds 5 moods were identified: _joy_, _sad_, _fear_, _disgust_, _scary_ and _anger_.

* For each mood, a set of synonymns are used and the score is taken as the synonymn and sentece average cosine distance 


In [11]:
import numpy as np
import regex as re

joy_words = ['happy','joy','pleasure','glee']
sad_words = ['sad','unhappy','gloomy']
scary_words = ['scary','frightening','terrifying', 'horrifying']
disgust_words = ['disgust', 'distaste', 'repulsion']
anger_words = ['anger','rage','irritated']

def syn_average(word, list_words = []):
    to_ret = 0
    count = 0 #use this in case a word isnt in dict
    for syn in list_words:
        if syn in words.dictionary:
            syn_id = words.dictionary[syn]
            to_ret+=np.matmul(embeddings[word].reshape(1,128), embeddings[syn_id].reshape(128,1))
            count +=1
        else:
            print(syn," is not in dict")
    return to_ret/count

def test(string_words):
    happy = words.dictionary['joy']
    sad = words.dictionary['fear']
    scary = words.dictionary['sad']
    disgust = words.dictionary['disgust']
    anger = words.dictionary['anger']
    
    
    d2happy = 0 
    d2sad = 0 
    d2scary = 0 
    d2disgust = 0
    d2anger = 0
    for a in string_words:
        if a in words.dictionary:
            in_dict = words.dictionary[a]
            d2happy += syn_average(in_dict,joy_words)
            d2sad += syn_average(in_dict,sad_words)
            d2scary += syn_average(in_dict,scary_words)
            d2disgust += syn_average(in_dict,disgust_words)
            d2anger += syn_average(in_dict,anger_words )
            
    d2happy = d2happy/len(string_words)
    d2sad = d2sad/len(string_words)
    d2scary = d2scary/len(string_words)
    d2disgust = d2disgust/len(string_words)
    d2anger = d2anger/len(string_words)
    print(  max(d2happy,0),"\t",max(d2sad,0),"\t", max(d2scary,0),"\t", max(d2disgust,0),"\t", max(d2anger,0))

def plot_emotions(top = 8):
    emotions= [ words.dictionary['joy'], words.dictionary['fear'],
        words.dictionary['sad'], words.dictionary['disgust'], words.dictionary['anger'] ]
        
    for i,i_word in enumerate(emotions):
        sim = embeddings.similarity(embeddings)        
        nearest = (-sim[i_word, :]).argsort()[1:top+1]
        print('Nearest to ', emotions[i], ": ")
        for k in range(top):
            close_word = words.reverse_dictionary(nearest[k])
            print('\t',close_word)
        
        
    

## Proof of principal - _ish_

To test that the algorithms are worlking we run over a couple of sentences, identified by a human to be happy, scary and agry respectively.

Negative scores (i.e. the sentences average embedding vectors points in opposite direction relative to the mood) are set to 0.



In [12]:
happy_string_ = "Even Harry, who knew nothing about the different brooms, thought it looked wonderful. Sleek and shiny, with a mahogany handle, it had a long tail of neat, straight twigs and Nimbus Two Thousand written in gold near the top. As seven o'clock drew nearer, Harry left the castle and set off in the dusk toward the Quidditch field. Held never been inside the stadium before. Hundreds of seats were raised in stands around the field so that the spectators were high enough to see what was going on. At either end of the field were three golden poles with hoops on the end. They reminded Harry of the little plastic sticks Muggle children blew bubbles through, except that they were fifty feet high. Too eager to fly again to wait for Wood, Harry mounted his broomstick and kicked off from the ground. What a feeling -- he swooped in and out of the goal posts and then sped up and down the field. The Nimbus Two Thousand turned wherever he wanted at his lightest touch."
scary_string = "and the next second, Harry felt Quirrell's hand close on his wrist. At once, a needle-sharp pain seared across Harry's scar; his head felt as though it was about to split in two; he yelled, struggling with all his might, and to his surprise, Quirrell let go of him. The pain in his head lessened -- he looked around wildly to see where Quirrell had gone, and saw him hunched in pain, looking at his fingers -- they were blistering before his eyes."
angry_string = 'He’d forgotten all about the people in cloaks until he passed a group of them next to the baker’s. He eyed them angrily as he passed. He didn’t know why, but they made him uneasy. This bunch were whispering  excitedly, too, and he couldn’t see a single collectingtin. It was on his way back past them, clutching a large doughnut in a bag, that he caught a few words of what they were saying.'

happy_string_words = re.sub(r"\p{P}+", "", happy_string_).split()
scary_string_words = re.sub(r"\p{P}+", "", scary_string).split()
angry_string_words = re.sub(r"\p{P}+", "",angry_string).split()
print("\n")
print("Sentence: ")
print(happy_string_)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(happy_string_words)
print("\n")
print("Sentence: ")
print(scary_string)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(scary_string_words)
print("\n")
print("Sentence: ")
print(angry_string)
print("Similarity to: ")
print("happy \t\t sad \t\t scary \t\t disgust \t\t anger")
test(angry_string_words)



Sentence: 
Even Harry, who knew nothing about the different brooms, thought it looked wonderful. Sleek and shiny, with a mahogany handle, it had a long tail of neat, straight twigs and Nimbus Two Thousand written in gold near the top. As seven o'clock drew nearer, Harry left the castle and set off in the dusk toward the Quidditch field. Held never been inside the stadium before. Hundreds of seats were raised in stands around the field so that the spectators were high enough to see what was going on. At either end of the field were three golden poles with hoops on the end. They reminded Harry of the little plastic sticks Muggle children blew bubbles through, except that they were fifty feet high. Too eager to fly again to wait for Wood, Harry mounted his broomstick and kicked off from the ground. What a feeling -- he swooped in and out of the goal posts and then sped up and down the field. The Nimbus Two Thousand turned wherever he wanted at his lightest touch.
Similarity to: 
happy 	

## Results
   While the reuslts are not exactly promising, 3 examples is in no way statitstically significant. Optimally I would like to establish a test dataset of sentences labeled by humans to get a more representative understanding of performance.
## Outlook
 Once we establish a method to gauge the performance of an algorithm we can try to improve the performance (_if necesary_)
  Potential improvements:
  * Skipgram versus CBOW
  * Larger training data-sets? 
  * Which words carry the most weight in a sentence? (Only use some combination of verbs, nouns, adjectives, ...)
  * This was fun, but there are trained models out there that are most likely more efficient & are better trained. Use them