## Practical 2: Text Classification with Word Embedding
<p>Oxford CS - Deep NLP 2017<br>
https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>
<p>[Yannis Assael, Brendan Shillingford, Chris Dyer]</p>

In [1]:
import numpy as np
import os
from random import shuffle
import re

In [2]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

### Load TED dataset 

In [3]:
import urllib.request
import zipfile
import lxml.etree

In [4]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [5]:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
    #print(lxml.etree.tostring(doc).decode('ascii')[10000:30000])

input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

In [6]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
    #print(lxml.etree.tostring(doc).decode('ascii')[10000:30000])

# input_text = '\n'.join(doc.xpath('//content/text()'))

# print(type(doc.xpath('//content/text()')))  ->  list 
doc_list = doc.xpath('//content/text()')
label_list = doc.xpath('//keywords/text()')

del doc

In [7]:
print(doc_list[:2])

['Here are two reasons companies fail: they only do more of the same, or they only do what\'s new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I\'m actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.\nTo me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.\n(Laughter)\nFacit did too much exploitation. But exploration can go wild, too.\nA few years back, I worked closely alongside a

In [8]:
print(len(doc_list))

2085


In [9]:
print(label_list[:2])

['talks, business, creativity, curiosity, goal-setting, innovation, motivation, potential, success, work', 'talks, Planets, TEDx, bacteria, biology, engineering, environment, evolution, exploration, future, innovation, intelligence, microbiology, nature, potential, science']


In [10]:
print(len(label_list))

2085


In [11]:
labels = ['ooo', 'Too', 'oEo', 'ooD', 'TEo', 'ToD', 'oED', 'TED']
label_dict = {labels[i]: i for i in range(8)}
print(label_dict)

{'ooo': 0, 'Too': 1, 'oEo': 2, 'ooD': 3, 'TEo': 4, 'ToD': 5, 'oED': 6, 'TED': 7}


In [12]:
def get_label(keywords):
    label_string = keywords.lower()
    if ("technology" in label_string) and ("entertainment" in label_string) and ("design" in label_string):
        return label_dict['TED']
    elif ("entertainment" in label_string) and ("design" in label_string):
        return label_dict['oED']
    elif ("technology" in label_string) and ("design" in label_string):
        return label_dict['ToD']
    elif ("technology" in label_string) and ("entertainment" in label_string):
        return label_dict['TEo']
    elif ("design" in label_string):
        return label_dict['ooD']
    elif ("entertainment" in label_string):
        return label_dict['oEo']
    elif ("technology" in label_string):
        return label_dict['Too']
    else:
        return label_dict['ooo']
   
# for keywords in label_list[:10]:
#     print(keywords)
label_list_temp = [get_label(keywords) for keywords in label_list]        

In [13]:
print(len(label_list_temp))
print(label_list_temp[:10])

2085
[0, 0, 0, 3, 5, 0, 0, 0, 0, 5]


In [14]:
labelled_doc = list(zip(doc_list, label_list_temp))

In [15]:
print(labelled_doc[1])

('So there are lands few and far between on Earth itself that are hospitable to humans by any measure, but survive we have. Our primitive ancestors, when they found their homes and livelihood endangered, they dared to make their way into unfamiliar territories in search of better opportunities. And as the descendants of these explorers, we have their nomadic blood coursing through our own veins. But at the same time, distracted by our bread and circuses and embroiled in the wars that we have waged on each other, it seems that we have forgotten this desire to explore. We, as a species, we\'re evolved uniquely for Earth, on Earth, and by Earth, and so content are we with our living conditions that we have grown complacent and just too busy to notice that its resources are finite, and that our Sun\'s life is also finite. While Mars and all the movies made in its name have reinvigorated the ethos for space travel, few of us seem to truly realize that our species\' fragile constitution is w

In [16]:
print(len(labelled_doc))

2085


In [17]:
train_doc_temp = labelled_doc[:1585]
valid_doc_temp = labelled_doc[-500:-250]
test_doc_temp = labelled_doc[-250:]

print(len(train_doc_temp))
print(len(valid_doc_temp))
print(len(test_doc_temp))

1585
250
250


In [18]:
print(train_doc_temp[0])

('Here are two reasons companies fail: they only do more of the same, or they only do what\'s new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I\'m actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.\nTo me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.\n(Laughter)\nFacit did too much exploitation. But exploration can go wild, too.\nA few years back, I worked closely alongside a

In [19]:
def tokenize_and_lowercase(text):
    text_noparens = re.sub(r'\([^)]*\)', '', text)
    sentences_strings = []
    for line in text_noparens.split('\n'):
        m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
        sentences_strings.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
        
    sentences= []
    for sent_str in sentences_strings:
        tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
        sentences.append(tokens)
    return sentences

In [20]:
train_doc = [(tokenize_and_lowercase(doc_temp[0]), doc_temp[1]) for doc_temp in train_doc_temp]

In [21]:
print(len(train_doc))

1585


In [22]:
print(train_doc[0], '\n')
print(train_doc[0][0], '\n')
print(train_doc[0][0][0])

([['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation'], ['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing'], ['consider', 'facit'], ['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them'], ['facit', 'was', 'a', 'fantastic', 'company'], ['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world'], ['everybody', 'used', 'them'], ['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along', 'they', 'continued', 'doing', 'exactly', 'the', 'same'], ['in', 'six', 'months', 'they', 'went', 'from', 'maximum', 'revenue'], ['an

In [23]:
input_text = '\n'.join([train_doc_temp[i][0] for i in range(len(train_doc_temp))])

In [24]:
print(input_text[:10000])

Here are two reasons companies fail: they only do more of the same, or they only do what's new.
To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.
Consider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.
To me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.
(Laughter)
Facit did too much exploitation. But exploration can go wild, too.
A few years back, I worked closely alongside a European 

In [25]:
sentences_ted = tokenize_and_lowercase(input_text)

In [26]:
print(len(sentences_ted))
print(sentences_ted[0])
print(sentences_ted[1])

188306
['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']


## Build vocabulary using training set 

In [27]:
from collections import Counter 

In [28]:
k = []
for sent in sentences_ted:
    for word in sent:
        if word not in k:
            k.append(word)
print(len(k))

47115


In [29]:
del k 

In [30]:
counts_ted_top100000 = []
c = Counter([word for sent in sentences_ted for word in sent])

In [31]:
list_most_common = c.most_common(47000)
print(list_most_common[:10])

[('the', 148314), ('and', 106841), ('to', 90699), ('of', 81748), ('a', 75296), ('that', 67989), ('i', 58165), ('in', 56239), ('it', 51467), ('we', 49525)]


In [32]:
words_most_common = [ item[0] for item in list_most_common]
print(words_most_common[:10])

['the', 'and', 'to', 'of', 'a', 'that', 'i', 'in', 'it', 'we']


In [33]:
for word, count in list_most_common:
    counts_ted_top100000.append(count)
    
print(counts_ted_top100000[:10])
print(counts_ted_top100000[-10:])

[148314, 106841, 90699, 81748, 75296, 67989, 58165, 56239, 51467, 49525]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [34]:
unknown_token = "UnknownTK"

def replace_unknown_token(sent_list):
#     for word in sent:
#         if word not in words_most_common:
#             sent=sent.replace(word, unknown_token)
#     return sent 
    filtered_list = [word if word in words_most_common else unknown_token for word in sent_list]  # so fast !!!
    return filtered_list

def tokenize_and_lowercase_most_common(text):
    text_noparens = re.sub(r'\([^)]*\)', '', text)
    sentences_strings = []
    for line in text_noparens.split('\n'):
        m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
        sentences_strings.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
        
    sentences= []
    for sent_str in sentences_strings:
        tokens = replace_unknown_token(re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split())
        if tokens != []:
            sentences.append(tokens)
    return sentences

In [35]:
# Training set 
train_doc = [(tokenize_and_lowercase_most_common(doc_temp[0]), doc_temp[1]) for doc_temp in train_doc_temp]

In [36]:
print(len(train_doc))

1585


In [37]:
print(train_doc[0], '\n')

([['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation'], ['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing'], ['consider', 'facit'], ['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them'], ['facit', 'was', 'a', 'fantastic', 'company'], ['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world'], ['everybody', 'used', 'them'], ['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along', 'they', 'continued', 'doing', 'exactly', 'the', 'same'], ['in', 'six', 'months', 'they', 'went', 'from', 'maximum', 'revenue'], ['an

In [38]:
print(train_doc[0][0], '\n')

[['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation'], ['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing'], ['consider', 'facit'], ['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them'], ['facit', 'was', 'a', 'fantastic', 'company'], ['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world'], ['everybody', 'used', 'them'], ['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along', 'they', 'continued', 'doing', 'exactly', 'the', 'same'], ['in', 'six', 'months', 'they', 'went', 'from', 'maximum', 'revenue'], ['and

In [114]:
for i in range(10):
    print(train_doc[0][0][i])

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']
['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing']
['consider', 'facit']
['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them']
['facit', 'was', 'a', 'fantastic', 'company']
['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world']
['everybody', 'used', 'them']
['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along', 'they', 'continued', 'doing', 'exactly', 'the', 'same']
['in', 'six', 'months', 'they', 'went', 'from', 'maximum', 'revenue']


### Rebuild the vocabulary 

In [39]:
counts_ted_top100000_new = []
            
c_new = Counter([word for doc in train_doc for sent in doc[0] for word in sent])

In [40]:
list_most_common_new = c_new.most_common(47001)
print(list_most_common_new[:10])

[('the', 148314), ('and', 106841), ('to', 90699), ('of', 81748), ('a', 75296), ('that', 67989), ('i', 58165), ('in', 56239), ('it', 51467), ('we', 49525)]


In [41]:
print(c_new['UnknownTK'])
list_most_common_new.append(('UnknownTK', 1))


115


In [42]:
words_most_common_new = [ item[0] for item in list_most_common_new]
print(words_most_common_new[:10])

['the', 'and', 'to', 'of', 'a', 'that', 'i', 'in', 'it', 'we']


In [43]:
for word, count in list_most_common_new:
    counts_ted_top100000_new.append(count)
    
print(counts_ted_top100000_new[:10])
print(counts_ted_top100000_new[-10:])

[148314, 106841, 90699, 81748, 75296, 67989, 58165, 56239, 51467, 49525]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [44]:
sentences = [sent for doc in train_doc for sent in doc[0]]

In [45]:
sentences[1]

['to',
 'me',
 'the',
 'real',
 'real',
 'solution',
 'to',
 'quality',
 'growth',
 'is',
 'figuring',
 'out',
 'the',
 'balance',
 'between',
 'two',
 'activities',
 'exploration',
 'and',
 'exploitation']

## Model 

### Train Word2Vec 

In [46]:
from gensim.models import Word2Vec

In [47]:
model_trained = Word2Vec(sentences, min_count=1, size=100)

In [48]:
model_trained.save("word2vec_model_vocab47001_mincount1")

In [49]:
model = Word2Vec.load("word2vec_model_vocab47001_mincount1")

In [50]:
print(len(model.wv.vocab))

47001


In [51]:
model.most_similar('UnknownTK')

[('nfl', 0.827136754989624),
 ('italian', 0.825472891330719),
 ('decimated', 0.825014054775238),
 ('congenital', 0.8245549201965332),
 ('l', 0.8221679925918579),
 ('roald', 0.8216572999954224),
 ('oculus', 0.8187201023101807),
 ('thrust', 0.8181350231170654),
 ('ox', 0.8158272504806519),
 ('technocrats', 0.815642774105072)]

In [52]:
print(model.wv['it'].shape)
print(model.wv['it'])

(100,)
[ 1.20787346  0.27783382  2.96324563  0.5772866  -0.61856365  0.17629476
  0.14164381 -0.35850075  0.950804   -2.22759342  0.62552279  0.93691742
  0.63060606 -0.5529179  -0.61665118  0.52841729 -0.23892108 -0.30634585
 -0.88684738  1.38360262  4.07542324  1.81895471  0.43627098 -0.40864843
  0.5563013  -0.15118779 -0.08758353 -0.22335196 -1.57582915  0.56395435
 -0.44580376  0.45437661 -2.15703321  3.26593924 -0.37317896 -1.11813104
 -0.45251217 -1.34095895  0.66345918  0.07126807 -0.22967933  0.96616226
 -0.32106581 -0.10020237  0.09889627 -1.24980712  0.14713278 -1.18839693
  0.76698643  1.31079948  1.03377664  0.73085886 -0.0812294   0.50940573
  0.5368554  -1.00265169  0.21758187 -1.05759156 -0.20690595  0.21874021
  1.55585802  0.07748078 -0.19541292 -1.14609861 -0.57014054 -2.20842385
  1.58665299  0.76519334 -1.34158385  1.17245936 -0.23474972  2.02114201
  1.10555387 -0.12429544  0.46398905  1.92463505 -0.41800261  1.03551757
  1.41510427  0.93166739  1.12278306  0.7923

## Checking the model with t-SNE

In [53]:
hist, edges = np.histogram(counts_ted_top1000_new, density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

NameError: name 'counts_ted_top1000_new' is not defined

In [None]:
words_top_ted_new = []
for word, count in list_most_common_new:
    words_top_ted_new.append(word)
    
print(words_top_ted_new[:10])

In [None]:
from sklearn.manifold import TSNE

In [None]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_vec_ted_new = model[words_top_ted_new]

tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne_new = tsne.fit_transform(words_top_vec_ted_new)

In [None]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne_new[:,0],
                                    x2=words_top_ted_tsne_new[:,1],
                                    names=words_top_ted_new))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

## Dataset

In [53]:
# Training set 
train_doc = [(tokenize_and_lowercase_most_common(doc_temp[0]), doc_temp[1]) for doc_temp in train_doc_temp]
train_doc = [item for item in train_doc if item[0] != []]

In [54]:
# Validation set 
valid_doc = [(tokenize_and_lowercase_most_common(doc_temp[0]), doc_temp[1]) for doc_temp in valid_doc_temp]
valid_doc = [item for item in valid_doc if item[0] != []]

In [55]:
# Test set 
test_doc = [(tokenize_and_lowercase_most_common(doc_temp[0]), doc_temp[1]) for doc_temp in test_doc_temp]
test_doc = [item for item in test_doc if item[0] != []]

In [56]:
print(valid_doc[0])

([['i', 'm', 'extremely', 'excited', 'to', 'be', 'given', 'the', 'opportunity', 'to', 'come', 'and', 'speak', 'to', 'you', 'today', 'about', 'what', 'i', 'consider', 'to', 'be', 'the', 'biggest', 'stunt', 'on', 'earth'], ['or', 'perhaps', 'not', 'quite', 'on', 'earth'], ['a', 'parachute', 'jump', 'from', 'the', 'very', 'edge', 'of', 'space'], ['more', 'about', 'that', 'a', 'bit', 'later', 'on'], ['what', 'i', 'd', 'like', 'to', 'do', 'first', 'is', 'take', 'you', 'through', 'a', 'very', 'brief', 'helicopter', 'ride', 'of', 'stunts', 'and', 'the', 'stunts', 'industry', 'in', 'the', 'movies', 'and', 'in', 'television', 'and', 'show', 'you', 'how', 'technology', 'has', 'started', 'to', 'interface', 'with', 'the', 'physical', 'skills', 'of', 'the', 'stunt', 'performer', 'in', 'a', 'way', 'that', 'makes', 'the', 'stunts', 'bigger', 'and', 'actually', 'makes', 'them', 'safer', 'than', 'they', 've', 'ever', 'been', 'before'], ['i', 've', 'been', 'a', 'professional', 'stunt', 'man', 'for', '13

In [57]:
print(test_doc[0])

([['i', 'had', 'a', 'fire', 'nine', 'days', 'ago'], ['my', 'archive', '175', 'films', 'my', '16', 'millimeter', 'negative', 'all', 'my', 'books', 'my', 'dad', 's', 'books', 'my', 'photographs'], ['i', 'd', 'collected', 'i', 'was', 'a', 'collector', 'major', 'big', 'time'], ['it', 's', 'gone'], ['i', 'just', 'looked', 'at', 'it', 'and', 'i', 'didn', 't', 'know', 'what', 'to', 'do'], ['i', 'mean', 'this', 'was', 'was', 'i', 'my', 'things', 'i', 'always', 'live', 'in', 'the', 'present', 'i', 'love', 'the', 'present'], ['i', 'cherish', 'the', 'future'], ['and', 'i', 'was', 'taught', 'some', 'strange', 'thing', 'as', 'a', 'kid', 'like', 'you', 've', 'got', 'to', 'make', 'something', 'good', 'out', 'of', 'something', 'bad'], ['you', 've', 'got', 'to', 'make', 'something', 'good', 'out', 'of', 'something', 'bad'], ['this', 'was', 'bad', 'man', 'i', 'was', 'i', 'cough'], ['i', 'was', 'sick'], ['that', 's', 'my', 'camera', 'lens'], ['the', 'first', 'one', 'the', 'one', 'i', 'shot', 'my', 'bob',

## Bag of Means 

In [58]:
def embed_text(model, text):
    """ embed the input text as a model vector 
    
    Arguments:
        model: Word2Vec model.
        text: input text
    
    Outputs:
        embedded vector 
    """
    vector_list = [model.wv[word] for sent in text for word in sent]
    return sum(vector_list) / len(vector_list)
    

In [59]:
test_bom = embed_text(model, train_doc[0][0])
print(test_bom.shape)
print(test_bom)

(100,)
[ 0.61317599  0.13318765  0.877092    0.30233732  0.50376517  0.04399596
 -0.08663318 -0.29708579  0.69379228  0.10649608 -0.01964743  0.54565632
  0.36889261 -0.32808724  0.20847213 -0.24323185  0.07905903 -0.09735632
 -0.1074714  -0.13894127  0.41548389  0.45580539  0.04107737  0.11766125
 -0.16154069  0.333588   -0.02465178  0.18323582  0.3343983   0.4087621
 -0.1403546   0.34203297 -0.16795711  0.20591733  0.14874336 -0.00501728
 -0.34611276 -0.30757073  0.300657    0.66527104  0.43257919  0.35578319
  0.13519806  0.0440546   0.11838242 -0.35892734  0.13887228 -0.3578656
  0.26338345  0.67413014  0.27571344  0.07908486 -0.19135548  0.11566199
 -0.47183725 -0.15311864 -0.57623869 -0.2464835  -0.13866121  0.28344458
  0.38064057  0.271348   -0.4831897   0.31397161 -0.31131256 -0.74351436
  0.18034452  0.20382388 -0.21868213  0.49019992 -0.27636632  0.2451379
  0.11290224  0.14161813  0.04779363  0.03962253 -0.26890409  0.31561318
  0.06359278 -0.53328043  0.68259364 -0.0711818

In [60]:
def embed_corpus(model, corpus):
    return np.asarray([embed_text(model, doc[0]) for doc in corpus])

In [61]:
train_doc_embedded = embed_corpus(model, train_doc)

In [62]:
print(train_doc_embedded.shape)
print(train_doc_embedded[1])

(1579, 100)
[ 0.4802019   0.07306197  0.76076442  0.265679    0.65074688 -0.13978228
 -0.05084927 -0.55925024  0.7502864   0.10501765  0.0969109   0.55776674
  0.47652876 -0.11450881  0.07318046 -0.35512263 -0.04953027 -0.09045479
 -0.02341042 -0.23051201  0.43806779  0.17775384  0.2505416   0.17436594
  0.04472227  0.20244192 -0.07194624  0.1739772   0.22243074  0.13083898
 -0.23973735  0.36757773 -0.03951558  0.18183237 -0.09743074  0.00374067
 -0.20940034 -0.46842694  0.24585925  0.71721613  0.3955411   0.33265039
  0.16678037  0.03051815 -0.09335352 -0.45025772  0.18648751 -0.37605691
  0.23591815  0.50330418  0.18871799  0.13318273 -0.03405144 -0.0985103
 -0.25906378 -0.11956112 -0.49184349 -0.24498062 -0.12874043  0.18451999
  0.30606422  0.39163041 -0.48322365  0.17247336 -0.29539114 -0.55813003
 -0.0102161   0.22323622 -0.38711059  0.44130534 -0.45928118  0.17236429
  0.14857092  0.0618463   0.15777689  0.04661241 -0.23681085  0.37428489
 -0.08256655 -0.43527746  0.78743666 -0.

In [63]:
def encode_label(label, size):
    l = [0]*size
    l[label] = 1
    return l

def encode_class(corpus, size):
    return np.asarray([encode_label(doc[1], size) for doc in corpus])

In [64]:
print(encode_label(2, 8))
print(encode_class(train_doc[:5], 8))

[0, 0, 1, 0, 0, 0, 0, 0]
[[1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0]]


In [65]:
def embedded_with_class(model, doc, size):
    doc_x = embed_corpus(model, doc)
    doc_y = encode_class(doc, size)
    return doc_x, doc_y

In [66]:
print(embedded_with_class(model, train_doc[:5], 8))

(array([[ 0.61317599,  0.13318765,  0.877092  ,  0.30233732,  0.50376517,
         0.04399596, -0.08663318, -0.29708579,  0.69379228,  0.10649608,
        -0.01964743,  0.54565632,  0.36889261, -0.32808724,  0.20847213,
        -0.24323185,  0.07905903, -0.09735632, -0.1074714 , -0.13894127,
         0.41548389,  0.45580539,  0.04107737,  0.11766125, -0.16154069,
         0.333588  , -0.02465178,  0.18323582,  0.3343983 ,  0.4087621 ,
        -0.1403546 ,  0.34203297, -0.16795711,  0.20591733,  0.14874336,
        -0.00501728, -0.34611276, -0.30757073,  0.300657  ,  0.66527104,
         0.43257919,  0.35578319,  0.13519806,  0.0440546 ,  0.11838242,
        -0.35892734,  0.13887228, -0.3578656 ,  0.26338345,  0.67413014,
         0.27571344,  0.07908486, -0.19135548,  0.11566199, -0.47183725,
        -0.15311864, -0.57623869, -0.2464835 , -0.13866121,  0.28344458,
         0.38064057,  0.271348  , -0.4831897 ,  0.31397161, -0.31131256,
        -0.74351436,  0.18034452,  0.20382388, -0.

In [67]:
train_doc_embed_with_class = embedded_with_class(model, train_doc, len(label_dict))
print(train_doc_embed_with_class[:5])

(array([[ 0.61317599,  0.13318765,  0.877092  , ..., -0.16908757,
         0.64099133, -0.4423787 ],
       [ 0.4802019 ,  0.07306197,  0.76076442, ..., -0.23957846,
         0.57708287, -0.41351196],
       [ 0.56784266, -0.03369169,  0.87067646, ..., -0.21189107,
         0.62699628, -0.41352344],
       ..., 
       [ 0.56456327,  0.00430289,  1.02511215, ..., -0.39203775,
         0.66593307, -0.41639647],
       [ 0.59195858,  0.09241927,  0.98085636, ..., -0.22513945,
         0.75796807, -0.4565427 ],
       [ 0.5245719 ,  0.01454974,  0.80657387, ..., -0.32598445,
         0.69195735, -0.32277915]], dtype=float32), array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]]))


In [68]:
print(len(label_dict))
print(train_doc_embed_with_class[0].shape)
print(train_doc_embed_with_class[1].shape)

8
(1579, 100)
(1579, 8)


## TensorFlow model 

In [69]:
import tensorflow as tf 

In [93]:
epoch = 2000
learning_rate = 0.1
batch_size = 50
total_batch = int(train_doc_embedded.shape[0] / batch_size)
print(total_batch)
index = 0

31


In [139]:
x = tf.placeholder(tf.float32, shape=[None, 100])
y = tf.placeholder(tf.int64, shape=[None, 8])

In [140]:
W = tf.Variable(tf.truncated_normal(shape=[100, 256]))
b = tf.Variable(tf.constant(0.0, shape=[256]))

V = tf.Variable(tf.truncated_normal(shape=[256, 8]))
c = tf.Variable(tf.constant(0.0, shape=[8]))

In [141]:
h = tf.tanh(tf.matmul(x, W) + b)
u = tf.matmul(h, V) + c

In [142]:
print(h.shape)
print(u.shape)

(?, 256)
(?, 8)


In [143]:
p = tf.nn.softmax(u)
print(p.shape)

(?, 8)


In [144]:
pred = tf.argmax(p, 1)

In [145]:
print(pred.shape)

(?,)


In [146]:
loss = tf.reduce_mean(tf.reduce_sum(-tf.cast(y, tf.float32)*tf.log(tf.clip_by_value(p, 1e-10, 1.0)), 1))

In [147]:
# loss = tf.reduce_mean(tf.reduce_sum(-tf.cast(y, tf.float32)*tf.log(p), 1))

In [148]:
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

In [149]:
def next_batch(data, index, size):
    """ return next batch in format: index, x batch, y batch
    """
    if index + size <= data[0].shape[0]:
        return index+size, data[0][index:index+size], data[1][index:index+size]
    else:
        return index+size-data[0].shape[0], np.concatenate((data[0][index:],data[0][:index+size-data[0].shape[0]]), 0), np.concatenate((data[1][index:],data[1][:index+size-data[1].shape[0]]), 0)

In [150]:
k, t, q = next_batch(train_doc_embed_with_class, 1500, 200)
print(k)
print(t.shape)
print(q.shape)

121
(200, 100)
(200, 8)


In [151]:
sess = tf.InteractiveSession()

In [154]:
init = tf.global_variables_initializer()
sess.run(init)

In [155]:
for i in range(epoch):
    xloss = 0
    
    for j in range(total_batch):
        # need to incoporate y in the batches and expand to 8 classes 
        index, x_, y_ = next_batch(train_doc_embed_with_class, index, batch_size)
#         print(x_.shape, y_.shape)
#         print(type(x_), type(y_))
        _, xloss = sess.run([optimizer, loss], feed_dict={x: x_, y: y_})
        
        if j % 10 == 0:
            print("epoch %d, run %d, loss %g" % (i, j, xloss))
            

epoch 0, run 0, loss 16.8623
epoch 0, run 10, loss 2.7631
epoch 0, run 20, loss 8.74982
epoch 0, run 30, loss 10.1314
epoch 1, run 0, loss 12.434
epoch 1, run 10, loss 3.68414
epoch 1, run 20, loss 8.74982
epoch 1, run 30, loss 11.0524
epoch 2, run 0, loss 11.5129
epoch 2, run 10, loss 7.36827
epoch 2, run 20, loss 8.28931
epoch 2, run 30, loss 12.434
epoch 3, run 0, loss 10.5919
epoch 3, run 10, loss 9.67086
epoch 3, run 20, loss 10.5919
epoch 3, run 30, loss 12.8945
epoch 4, run 0, loss 11.5129
epoch 4, run 10, loss 8.74982
epoch 4, run 20, loss 9.67086
epoch 4, run 30, loss 10.1314
epoch 5, run 0, loss 13.355
epoch 5, run 10, loss 10.1314
epoch 5, run 20, loss 7.36827
epoch 5, run 30, loss 10.5919
epoch 6, run 0, loss 12.8945
epoch 6, run 10, loss 13.8155
epoch 6, run 20, loss 4.60517
epoch 6, run 30, loss 11.5129
epoch 7, run 0, loss 8.74982
epoch 7, run 10, loss 11.5129
epoch 7, run 20, loss 5.5262
epoch 7, run 30, loss 10.1314
epoch 8, run 0, loss 11.5129
epoch 8, run 10, loss 10

epoch 74, run 20, loss 9.21034
epoch 74, run 30, loss 8.74982
epoch 75, run 0, loss 8.74982
epoch 75, run 10, loss 10.5919
epoch 75, run 20, loss 10.1314
epoch 75, run 30, loss 9.67086
epoch 76, run 0, loss 7.82879
epoch 76, run 10, loss 11.5129
epoch 76, run 20, loss 8.74982
epoch 76, run 30, loss 8.74982
epoch 77, run 0, loss 9.67086
epoch 77, run 10, loss 14.7365
epoch 77, run 20, loss 11.0524
epoch 77, run 30, loss 6.44724
epoch 78, run 0, loss 8.74982
epoch 78, run 10, loss 11.9734
epoch 78, run 20, loss 13.8155
epoch 78, run 30, loss 5.06569
epoch 79, run 0, loss 8.74982
epoch 79, run 10, loss 9.21034
epoch 79, run 20, loss 11.0524
epoch 79, run 30, loss 5.06569
epoch 80, run 0, loss 5.98672
epoch 80, run 10, loss 11.5129
epoch 80, run 20, loss 11.5129
epoch 80, run 30, loss 5.98672
epoch 81, run 0, loss 5.5262
epoch 81, run 10, loss 11.5129
epoch 81, run 20, loss 11.0524
epoch 81, run 30, loss 5.5262
epoch 82, run 0, loss 4.60517
epoch 82, run 10, loss 9.21034
epoch 82, run 20, 

epoch 143, run 20, loss 10.1314
epoch 143, run 30, loss 3.68414
epoch 144, run 0, loss 6.90775
epoch 144, run 10, loss 11.5129
epoch 144, run 20, loss 11.9734
epoch 144, run 30, loss 2.7631
epoch 145, run 0, loss 5.5262
epoch 145, run 10, loss 10.1314
epoch 145, run 20, loss 13.355
epoch 145, run 30, loss 5.06569
epoch 146, run 0, loss 3.22362
epoch 146, run 10, loss 8.74982
epoch 146, run 20, loss 10.5919
epoch 146, run 30, loss 9.67086
epoch 147, run 0, loss 3.68414
epoch 147, run 10, loss 8.74982
epoch 147, run 20, loss 10.1314
epoch 147, run 30, loss 9.67086
epoch 148, run 0, loss 6.90775
epoch 148, run 10, loss 9.21034
epoch 148, run 20, loss 11.9734
epoch 148, run 30, loss 8.74982
epoch 149, run 0, loss 9.67086
epoch 149, run 10, loss 9.21034
epoch 149, run 20, loss 13.355
epoch 149, run 30, loss 11.9734
epoch 150, run 0, loss 9.21034
epoch 150, run 10, loss 8.74982
epoch 150, run 20, loss 10.1314
epoch 150, run 30, loss 13.355
epoch 151, run 0, loss 9.21034
epoch 151, run 10, lo

epoch 211, run 0, loss 8.28931
epoch 211, run 10, loss 4.14465
epoch 211, run 20, loss 7.82879
epoch 211, run 30, loss 11.0524
epoch 212, run 0, loss 10.1314
epoch 212, run 10, loss 4.14465
epoch 212, run 20, loss 8.28931
epoch 212, run 30, loss 10.1314
epoch 213, run 0, loss 11.9734
epoch 213, run 10, loss 3.68414
epoch 213, run 20, loss 6.44724
epoch 213, run 30, loss 10.1314
epoch 214, run 0, loss 11.5129
epoch 214, run 10, loss 7.36827
epoch 214, run 20, loss 6.44724
epoch 214, run 30, loss 10.5919
epoch 215, run 0, loss 10.1314
epoch 215, run 10, loss 8.74982
epoch 215, run 20, loss 8.28931
epoch 215, run 30, loss 9.21034
epoch 216, run 0, loss 10.5919
epoch 216, run 10, loss 5.5262
epoch 216, run 20, loss 11.5129
epoch 216, run 30, loss 12.8945
epoch 217, run 0, loss 9.67086
epoch 217, run 10, loss 3.68414
epoch 217, run 20, loss 9.21034
epoch 217, run 30, loss 11.9734
epoch 218, run 0, loss 10.1314
epoch 218, run 10, loss 2.30258
epoch 218, run 20, loss 8.28931
epoch 218, run 30

epoch 285, run 10, loss 11.9734
epoch 285, run 20, loss 4.14465
epoch 285, run 30, loss 5.98672
epoch 286, run 0, loss 9.67086
epoch 286, run 10, loss 11.9734
epoch 286, run 20, loss 8.28931
epoch 286, run 30, loss 6.90775
epoch 287, run 0, loss 7.36827
epoch 287, run 10, loss 10.5919
epoch 287, run 20, loss 8.28931
epoch 287, run 30, loss 9.21034
epoch 288, run 0, loss 6.90775
epoch 288, run 10, loss 11.5129
epoch 288, run 20, loss 5.98672
epoch 288, run 30, loss 11.5129
epoch 289, run 0, loss 6.90775
epoch 289, run 10, loss 10.5919
epoch 289, run 20, loss 4.14465
epoch 289, run 30, loss 7.82879
epoch 290, run 0, loss 11.5129
epoch 290, run 10, loss 11.0524
epoch 290, run 20, loss 2.7631
epoch 290, run 30, loss 8.74982
epoch 291, run 0, loss 10.5919
epoch 291, run 10, loss 13.355
epoch 291, run 20, loss 4.14465
epoch 291, run 30, loss 7.36827
epoch 292, run 0, loss 9.21034
epoch 292, run 10, loss 10.1314
epoch 292, run 20, loss 9.21034
epoch 292, run 30, loss 9.67086
epoch 293, run 0,

epoch 353, run 30, loss 5.98672
epoch 354, run 0, loss 5.06569
epoch 354, run 10, loss 9.21034
epoch 354, run 20, loss 9.21034
epoch 354, run 30, loss 4.14465
epoch 355, run 0, loss 6.44724
epoch 355, run 10, loss 11.0524
epoch 355, run 20, loss 9.67086
epoch 355, run 30, loss 5.06569
epoch 356, run 0, loss 5.98672
epoch 356, run 10, loss 11.0524
epoch 356, run 20, loss 11.5129
epoch 356, run 30, loss 4.60517
epoch 357, run 0, loss 3.68414
epoch 357, run 10, loss 8.28931
epoch 357, run 20, loss 11.0524
epoch 357, run 30, loss 5.98672
epoch 358, run 0, loss 4.14465
epoch 358, run 10, loss 8.28931
epoch 358, run 20, loss 11.5129
epoch 358, run 30, loss 8.74982
epoch 359, run 0, loss 3.22362
epoch 359, run 10, loss 5.98672
epoch 359, run 20, loss 10.5919
epoch 359, run 30, loss 7.36827
epoch 360, run 0, loss 6.90775
epoch 360, run 10, loss 5.98672
epoch 360, run 20, loss 11.0524
epoch 360, run 30, loss 5.5262
epoch 361, run 0, loss 8.74982
epoch 361, run 10, loss 8.28931
epoch 361, run 20

epoch 420, run 20, loss 11.5129
epoch 420, run 30, loss 8.28931
epoch 421, run 0, loss 9.67086
epoch 421, run 10, loss 10.1314
epoch 421, run 20, loss 15.1971
epoch 421, run 30, loss 11.5129
epoch 422, run 0, loss 9.67086
epoch 422, run 10, loss 8.74982
epoch 422, run 20, loss 11.5129
epoch 422, run 30, loss 13.8155
epoch 423, run 0, loss 9.21034
epoch 423, run 10, loss 8.28931
epoch 423, run 20, loss 9.21034
epoch 423, run 30, loss 10.1314
epoch 424, run 0, loss 12.434
epoch 424, run 10, loss 5.5262
epoch 424, run 20, loss 12.434
epoch 424, run 30, loss 12.434
epoch 425, run 0, loss 12.8945
epoch 425, run 10, loss 5.5262
epoch 425, run 20, loss 11.5129
epoch 425, run 30, loss 10.5919
epoch 426, run 0, loss 10.1314
epoch 426, run 10, loss 4.60517
epoch 426, run 20, loss 9.21034
epoch 426, run 30, loss 8.28931
epoch 427, run 0, loss 12.8945
epoch 427, run 10, loss 6.44724
epoch 427, run 20, loss 11.9734
epoch 427, run 30, loss 10.5919
epoch 428, run 0, loss 9.21034
epoch 428, run 10, lo

epoch 492, run 30, loss 12.434
epoch 493, run 0, loss 10.5919
epoch 493, run 10, loss 9.67086
epoch 493, run 20, loss 10.1314
epoch 493, run 30, loss 13.355
epoch 494, run 0, loss 11.0524
epoch 494, run 10, loss 8.74982
epoch 494, run 20, loss 9.67086
epoch 494, run 30, loss 10.1314
epoch 495, run 0, loss 13.355
epoch 495, run 10, loss 10.1314
epoch 495, run 20, loss 7.36827
epoch 495, run 30, loss 10.5919
epoch 496, run 0, loss 12.434
epoch 496, run 10, loss 13.355
epoch 496, run 20, loss 5.06569
epoch 496, run 30, loss 11.5129
epoch 497, run 0, loss 8.74982
epoch 497, run 10, loss 11.5129
epoch 497, run 20, loss 5.5262
epoch 497, run 30, loss 10.1314
epoch 498, run 0, loss 11.0524
epoch 498, run 10, loss 10.5919
epoch 498, run 20, loss 4.14465
epoch 498, run 30, loss 9.67086
epoch 499, run 0, loss 12.434
epoch 499, run 10, loss 11.5129
epoch 499, run 20, loss 5.98672
epoch 499, run 30, loss 11.5129
epoch 500, run 0, loss 9.21034
epoch 500, run 10, loss 9.21034
epoch 500, run 20, loss

epoch 560, run 10, loss 11.5129
epoch 560, run 20, loss 5.98672
epoch 560, run 30, loss 11.0524
epoch 561, run 0, loss 6.44724
epoch 561, run 10, loss 10.1314
epoch 561, run 20, loss 4.14465
epoch 561, run 30, loss 9.67086
epoch 562, run 0, loss 9.21034
epoch 562, run 10, loss 10.1314
epoch 562, run 20, loss 2.30259
epoch 562, run 30, loss 9.21034
epoch 563, run 0, loss 11.0524
epoch 563, run 10, loss 12.434
epoch 563, run 20, loss 3.68414
epoch 563, run 30, loss 7.82879
epoch 564, run 0, loss 8.28931
epoch 564, run 10, loss 11.0524
epoch 564, run 20, loss 9.21034
epoch 564, run 30, loss 8.28931
epoch 565, run 0, loss 8.74982
epoch 565, run 10, loss 10.5919
epoch 565, run 20, loss 10.5919
epoch 565, run 30, loss 9.67086
epoch 566, run 0, loss 7.36827
epoch 566, run 10, loss 11.5129
epoch 566, run 20, loss 8.28931
epoch 566, run 30, loss 8.74982
epoch 567, run 0, loss 10.1314
epoch 567, run 10, loss 14.7365
epoch 567, run 20, loss 11.0524
epoch 567, run 30, loss 6.90775
epoch 568, run 0

epoch 626, run 30, loss 5.06569
epoch 627, run 0, loss 5.06569
epoch 627, run 10, loss 10.1314
epoch 627, run 20, loss 9.67086
epoch 627, run 30, loss 5.06569
epoch 628, run 0, loss 5.98672
epoch 628, run 10, loss 11.5129
epoch 628, run 20, loss 10.5919
epoch 628, run 30, loss 4.60517
epoch 629, run 0, loss 3.68414
epoch 629, run 10, loss 9.67086
epoch 629, run 20, loss 11.9734
epoch 629, run 30, loss 4.14465
epoch 630, run 0, loss 5.06569
epoch 630, run 10, loss 8.74982
epoch 630, run 20, loss 11.0524
epoch 630, run 30, loss 8.28931
epoch 631, run 0, loss 4.14465
epoch 631, run 10, loss 7.36827
epoch 631, run 20, loss 10.5919
epoch 631, run 30, loss 8.28931
epoch 632, run 0, loss 5.98672
epoch 632, run 10, loss 6.90775
epoch 632, run 20, loss 11.0524
epoch 632, run 30, loss 5.98672
epoch 633, run 0, loss 8.28931
epoch 633, run 10, loss 6.90775
epoch 633, run 20, loss 10.1314
epoch 633, run 30, loss 3.68414
epoch 634, run 0, loss 7.36827
epoch 634, run 10, loss 11.9734
epoch 634, run 2

epoch 693, run 20, loss 13.8155
epoch 693, run 30, loss 10.5919
epoch 694, run 0, loss 9.67086
epoch 694, run 10, loss 9.21034
epoch 694, run 20, loss 12.8945
epoch 694, run 30, loss 13.8155
epoch 695, run 0, loss 8.74982
epoch 695, run 10, loss 8.28931
epoch 695, run 20, loss 8.28931
epoch 695, run 30, loss 11.5129
epoch 696, run 0, loss 11.0524
epoch 696, run 10, loss 5.98672
epoch 696, run 20, loss 12.434
epoch 696, run 30, loss 11.5129
epoch 697, run 0, loss 13.8155
epoch 697, run 10, loss 5.06569
epoch 697, run 20, loss 11.9734
epoch 697, run 30, loss 11.5129
epoch 698, run 0, loss 10.5919
epoch 698, run 10, loss 4.60517
epoch 698, run 20, loss 9.21034
epoch 698, run 30, loss 8.74982
epoch 699, run 0, loss 12.434
epoch 699, run 10, loss 6.90775
epoch 699, run 20, loss 10.5919
epoch 699, run 30, loss 9.67086
epoch 700, run 0, loss 10.5919
epoch 700, run 10, loss 5.98672
epoch 700, run 20, loss 11.0524
epoch 700, run 30, loss 11.9734
epoch 701, run 0, loss 8.28931
epoch 701, run 10,

epoch 761, run 20, loss 11.0524
epoch 761, run 30, loss 12.8945
epoch 762, run 0, loss 9.67086
epoch 762, run 10, loss 3.68414
epoch 762, run 20, loss 9.21034
epoch 762, run 30, loss 11.0524
epoch 763, run 0, loss 12.434
epoch 763, run 10, loss 3.22362
epoch 763, run 20, loss 8.28931
epoch 763, run 30, loss 9.67086
epoch 764, run 0, loss 12.8945
epoch 764, run 10, loss 5.5262
epoch 764, run 20, loss 8.74982
epoch 764, run 30, loss 11.5129
epoch 765, run 0, loss 10.1314
epoch 765, run 10, loss 9.21034
epoch 765, run 20, loss 9.67086
epoch 765, run 30, loss 14.276
epoch 766, run 0, loss 11.0524
epoch 766, run 10, loss 9.67086
epoch 766, run 20, loss 9.21034
epoch 766, run 30, loss 11.0524
epoch 767, run 0, loss 12.434
epoch 767, run 10, loss 9.67086
epoch 767, run 20, loss 7.36827
epoch 767, run 30, loss 9.67086
epoch 768, run 0, loss 12.8945
epoch 768, run 10, loss 11.9734
epoch 768, run 20, loss 5.5262
epoch 768, run 30, loss 11.9734
epoch 769, run 0, loss 10.1314
epoch 769, run 10, lo

KeyboardInterrupt: 

In [156]:
accuracy = tf.reduce_mean(tf.cast(tf.equal(pred, tf.argmax(y, 1)), tf.float32))

In [157]:
valid_doc_embed_with_class = embedded_with_class(model, valid_doc, len(label_dict))

In [158]:
print(len(valid_doc_embed_with_class))
print(valid_doc_embed_with_class[0].shape)
print(valid_doc_embed_with_class[1].shape)
print(valid_doc_embed_with_class[:2])

2
(250, 100)
(250, 8)
(array([[ 0.45717028, -0.01775449,  0.89623034, ..., -0.31501675,
         0.68082738, -0.41110277],
       [ 0.47098407, -0.00393966,  0.71919209, ..., -0.52422023,
         0.59656417, -0.30626851],
       [ 0.60405642,  0.04082353,  0.88992083, ..., -0.23265001,
         0.67125058, -0.46151358],
       ..., 
       [ 0.57207495,  0.04660853,  0.87777597, ..., -0.30794084,
         0.72478807, -0.45665005],
       [ 0.38446307,  0.1561233 ,  0.80170846, ..., -0.1633113 ,
         0.5337075 , -0.36341241],
       [ 0.55622065,  0.03490572,  0.83159536, ..., -0.35373959,
         0.63436121, -0.40071458]], dtype=float32), array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ..., 
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]]))


In [159]:
test_doc_embed_with_class = embedded_with_class(model, test_doc, len(label_dict))

In [160]:
print(len(test_doc_embed_with_class))
print(test_doc_embed_with_class[0].shape)
print(test_doc_embed_with_class[1].shape)
print(test_doc_embed_with_class[1])

2
(249, 100)
(249, 8)
[[1 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 1]
 [1 0 0 ..., 0 0 0]
 ..., 
 [0 1 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 1]
 [1 0 0 ..., 0 0 0]]


In [161]:
x_valid, y_valid = (valid_doc_embed_with_class[0], valid_doc_embed_with_class[1])
print(x_valid.shape)
print(y_valid.shape)

(250, 100)
(250, 8)


In [162]:
print(x.shape)
print(y.shape)

(?, 100)
(?, 8)


In [163]:
acc = sess.run(accuracy, feed_dict={x:valid_doc_embed_with_class[0], y:valid_doc_embed_with_class[1]})
print(acc)

0.344


In [164]:
epoch = 6000

In [165]:
init = tf.global_variables_initializer()
sess.run(init)

In [166]:
for i in range(epoch):
    xloss = 0
    acc = 0.0
    
    for j in range(total_batch):
        # need to incoporate y in the batches and expand to 8 classes 
        index, x_, y_ = next_batch(train_doc_embed_with_class, index, batch_size)
        _, xloss = sess.run([optimizer, loss], feed_dict={x: x_, y: y_})
        
        if j % 30 == 0:
            print("epoch %d, run %d, loss %g" % (i, j, xloss))
            
    if i % 100 == 0:
        acc = sess.run(accuracy, feed_dict={x:valid_doc_embed_with_class[0], y:valid_doc_embed_with_class[1]})
        print("Validation acc: %g" % (acc * 100), end="")
        print("%")

epoch 0, run 0, loss 21.3644
epoch 0, run 30, loss 22.5653
Validation acc: 13.2%
epoch 1, run 0, loss 21.1838
epoch 1, run 30, loss 22.5653
epoch 2, run 0, loss 22.1048
epoch 2, run 30, loss 21.1838
epoch 3, run 0, loss 22.1048
epoch 3, run 30, loss 20.7233
epoch 4, run 0, loss 22.1048
epoch 4, run 30, loss 21.6443
epoch 5, run 0, loss 20.7233
epoch 5, run 30, loss 21.6443
epoch 6, run 0, loss 21.6443
epoch 6, run 30, loss 21.1838
epoch 7, run 0, loss 21.6443
epoch 7, run 30, loss 21.6443
epoch 8, run 0, loss 21.6443
epoch 8, run 30, loss 22.5653
epoch 9, run 0, loss 21.1838
epoch 9, run 30, loss 23.0259
epoch 10, run 0, loss 22.1048
epoch 10, run 30, loss 22.1048
epoch 11, run 0, loss 23.0259
epoch 11, run 30, loss 21.6443
epoch 12, run 0, loss 22.5653
epoch 12, run 30, loss 22.1048
epoch 13, run 0, loss 22.1048
epoch 13, run 30, loss 23.0259
epoch 14, run 0, loss 21.6443
epoch 14, run 30, loss 22.5653
epoch 15, run 0, loss 22.1048
epoch 15, run 30, loss 22.5653
epoch 16, run 0, loss 

epoch 141, run 0, loss 19.8022
epoch 141, run 30, loss 20.7233
epoch 142, run 0, loss 21.1838
epoch 142, run 30, loss 22.1048
epoch 143, run 0, loss 21.1838
epoch 143, run 30, loss 21.6443
epoch 144, run 0, loss 20.7233
epoch 144, run 30, loss 20.7233
epoch 145, run 0, loss 22.1048
epoch 145, run 30, loss 20.7233
epoch 146, run 0, loss 21.1838
epoch 146, run 30, loss 20.7233
epoch 147, run 0, loss 20.2627
epoch 147, run 30, loss 21.6443
epoch 148, run 0, loss 20.2627
epoch 148, run 30, loss 22.1048
epoch 149, run 0, loss 21.1838
epoch 149, run 30, loss 21.1838
epoch 150, run 0, loss 22.1048
epoch 150, run 30, loss 20.2627
epoch 151, run 0, loss 21.6443
epoch 151, run 30, loss 20.2627
epoch 152, run 0, loss 21.1838
epoch 152, run 30, loss 20.7233
epoch 153, run 0, loss 20.2627
epoch 153, run 30, loss 21.6443
epoch 154, run 0, loss 20.7233
epoch 154, run 30, loss 19.8022
epoch 155, run 0, loss 21.1838
epoch 155, run 30, loss 19.3417
epoch 156, run 0, loss 21.6443
epoch 156, run 30, loss 

KeyboardInterrupt: 

## Questions to answer 

- Compare the learning curves of the model starting from random embeddings, starting from GloVe embeddings (http://nlp.stanford.edu/data/glove.6B.zip; 50 dimensions) or fixed to be the GloVe values. Training in batches is more stable (e.g. 50), which model works best on training vs. test? Which model works best on held-out accuracy?
- What happens if you try alternative non-linearities (logistic sigmoid or ReLU instead of tanh)?
- What happens if you add dropout to the network?
- What happens if you vary the size of the hidden layer?
- How would the code change if you wanted to add a second hidden layer?
- How does the training algorithm affect the quality of the model?
- Project the embeddings of the labels onto 2 dimensions and visualise (each row of the projection matrix V corresponds a label embedding). Do you see anything interesting?

In [170]:
epoch = 10000

W = tf.Variable(tf.truncated_normal(shape=[100, 256]))
b = tf.Variable(tf.constant(0.0, shape=[256]))

W2 = tf.Variable(tf.truncated_normal(shape=[256, 128]))
b2 = tf.Variable(tf.constant(0.0, shape=[128]))

V = tf.Variable(tf.truncated_normal(shape=[128, 8]))
c = tf.Variable(tf.constant(0.0, shape=[8]))

dropout_rate = tf.placeholder(tf.float32)

h = tf.nn.relu(tf.matmul(x, W) + b)
h2 = tf.nn.relu(tf.matmul(h, W2) + b2)
h2_drop = tf.nn.dropout(h2, keep_prob=dropout_rate)
u = tf.matmul(h2_drop, V) + c
p = tf.nn.softmax(u)
pred = tf.argmax(p, 1)
loss = tf.reduce_mean(tf.reduce_sum(-tf.cast(y, tf.float32)*tf.log(tf.clip_by_value(p, 1e-10, 1.0)), 1))

In [171]:
init = tf.global_variables_initializer()
sess.run(init)

for i in range(epoch):
    xloss = 0
    acc = 0.0
    
    for j in range(total_batch):
        # need to incoporate y in the batches and expand to 8 classes 
        index, x_, y_ = next_batch(train_doc_embed_with_class, index, batch_size)
        _, xloss = sess.run([optimizer, loss], feed_dict={x: x_, y: y_, dropout_rate: 0.5})
        
        if j % 30 == 0:
            print("epoch %d, run %d, loss %g" % (i, j, xloss))
            
    if i % 100 == 0:
        acc = sess.run(accuracy, feed_dict={x:valid_doc_embed_with_class[0], y:valid_doc_embed_with_class[1], dropout_rate: 1.0})
        print("epoch %d, Validation acc: %g" % (i, acc * 100), end="")
        print("%")

epoch 0, run 0, loss 22.1048
epoch 0, run 30, loss 20.7243
epoch 0, Validation acc: 34.4%
epoch 1, run 0, loss 21.1838
epoch 1, run 30, loss 22.1048
epoch 2, run 0, loss 22.414
epoch 2, run 30, loss 22.1119
epoch 3, run 0, loss 21.1843
epoch 3, run 30, loss 21.6443
epoch 4, run 0, loss 21.6443
epoch 4, run 30, loss 21.1838
epoch 5, run 0, loss 21.6443
epoch 5, run 30, loss 21.3803
epoch 6, run 0, loss 21.9246
epoch 6, run 30, loss 21.4713
epoch 7, run 0, loss 21.6443
epoch 7, run 30, loss 21.6443
epoch 8, run 0, loss 21.6443
epoch 8, run 30, loss 21.6987
epoch 9, run 0, loss 21.1838
epoch 9, run 30, loss 20.7233
epoch 10, run 0, loss 22.5653
epoch 10, run 30, loss 21.0082
epoch 11, run 0, loss 22.1048
epoch 11, run 30, loss 20.7233
epoch 12, run 0, loss 21.1838
epoch 12, run 30, loss 20.6662
epoch 13, run 0, loss 20.7233
epoch 13, run 30, loss 21.5665
epoch 14, run 0, loss 22.5653
epoch 14, run 30, loss 21.1838
epoch 15, run 0, loss 21.9418
epoch 15, run 30, loss 21.3755
epoch 16, run 

epoch 136, run 30, loss 22.4468
epoch 137, run 0, loss 20.7233
epoch 137, run 30, loss 22.1048
epoch 138, run 0, loss 22.5653
epoch 138, run 30, loss 22.5653
epoch 139, run 0, loss 20.7233
epoch 139, run 30, loss 23.0259
epoch 140, run 0, loss 21.6443
epoch 140, run 30, loss 21.6443
epoch 141, run 0, loss 23.0259
epoch 141, run 30, loss 20.7233
epoch 142, run 0, loss 19.5667
epoch 142, run 30, loss 22.1048
epoch 143, run 0, loss 23.0259
epoch 143, run 30, loss 21.1838
epoch 144, run 0, loss 21.6443
epoch 144, run 30, loss 20.7233
epoch 145, run 0, loss 20.7263
epoch 145, run 30, loss 20.9639
epoch 146, run 0, loss 21.1838
epoch 146, run 30, loss 21.3572
epoch 147, run 0, loss 21.352
epoch 147, run 30, loss 22.5653
epoch 148, run 0, loss 21.9001
epoch 148, run 30, loss 22.5653
epoch 149, run 0, loss 21.1908
epoch 149, run 30, loss 20.7233
epoch 150, run 0, loss 22.1048
epoch 150, run 30, loss 21.1838
epoch 151, run 0, loss 22.5653
epoch 151, run 30, loss 22.1048
epoch 152, run 0, loss 2

epoch 267, run 30, loss 22.4855
epoch 268, run 0, loss 22.1048
epoch 268, run 30, loss 22.5653
epoch 269, run 0, loss 21.067
epoch 269, run 30, loss 22.1017
epoch 270, run 0, loss 23.0259
epoch 270, run 30, loss 22.0562
epoch 271, run 0, loss 21.6443
epoch 271, run 30, loss 20.6421
epoch 272, run 0, loss 20.8923
epoch 272, run 30, loss 22.5653
epoch 273, run 0, loss 22.1048


KeyboardInterrupt: 

## Save data sets 

In [172]:
np.savez('corpus_all_temp', train_doc_temp=train_doc_temp, valid_doc_temp=valid_doc_temp, test_doc_temp=test_doc_temp)

In [173]:
np.savez('corpus_all_47001', train_doc=train_doc, valid_doc=valid_doc, test_doc=test_doc)