# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

# Dense Distributed Representations

In [1]:
import pandas as pd
df = pd.read_csv('../data/reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:2])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!']


list

## Word embeddings with `Word2vec`

### `Word2Vec` parameters:

- **size** (int, optional) – Dimensionality of the word vectors.

- **window** (int, optional) – Maximum distance between the current and predicted word within a sentence.

- **sample** (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

- **iter** (int, optional) – Number of iterations (epochs) over the corpus.

- **negative** (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

- **min_count** (int, optional) – Ignores all words with total frequency lower than this.

- **hs** ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.

- **workers** (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).



In [2]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

corpus = [document.split() for document in documents]

# initialize model
w2v_model = Word2Vec(size=100, # vector size
                     window=15, # window for sampling
                     sample=0.0001, # subsampling rate
                     iter=200, # iterations
                     negative=5, # negative samples
                     min_count=100, # minimum threshold
                     workers=-1, # parallelize to all cores
                     hs=0 # no hierarchical softmax
)

# build the vocabulary
w2v_model.build_vocab(corpus)

# train the model
w2v_model.train(corpus, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)


(0, 0)

Now, we can use the embeddings of the model via the `wv` (word vector) property. Each word maps to its vector

In [3]:
w2v_model.wv['delivery']

array([ 4.3967641e-03,  3.1956586e-03,  4.1554654e-03, -3.6858404e-03,
        2.0383890e-03, -3.3349530e-03,  3.9955527e-03, -3.2048434e-04,
        3.1501739e-03,  1.7654008e-03, -3.3951167e-03,  1.7206426e-03,
        4.4660675e-03, -1.4680600e-03, -1.8679150e-03, -3.3995991e-03,
        2.8441187e-03, -4.6684514e-03, -7.2946388e-04,  4.6961126e-03,
        3.0271111e-03,  2.5028719e-03,  1.1871291e-03,  3.2459085e-03,
       -4.2324271e-03,  1.1746244e-03,  3.4704435e-04, -3.7060806e-03,
        9.2344050e-04,  2.0046136e-03,  2.4872362e-03, -3.3609455e-03,
       -3.0028156e-03,  8.7623222e-04, -5.6574220e-04, -8.2645420e-04,
        2.1637853e-03,  4.0302886e-05, -3.4664737e-03, -4.2506945e-03,
       -1.6045283e-03,  2.6106955e-03, -3.7266579e-03, -2.0738917e-03,
        6.9069810e-04,  2.6558433e-03, -3.1873595e-03, -3.6215123e-03,
        2.0043796e-03, -3.2183380e-04,  7.3239760e-04, -3.7281048e-03,
       -2.6600130e-03, -3.5996456e-03,  6.6113472e-04,  2.3121831e-03,
      

We can find the most similar words:

In [4]:
w2v_model.wv.most_similar(['delivery'])

[('Such', 0.37224239110946655),
 ('join', 0.3092310428619385),
 ('building', 0.2821229100227356),
 ('Needed', 0.2772734761238098),
 ('mails', 0.2770233750343323),
 ('messed', 0.2687864303588867),
 ('almost', 0.26552754640579224),
 ('bars', 0.26520830392837524),
 ('individual', 0.26351213455200195),
 ('trust', 0.26270318031311035)]

We can also solve the analogy tasks by defining positive and negative words

In [5]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['birthday', 'husband'], negative=['present'], topn=3)

[('jewelry', 0.4413125514984131),
 ('finding', 0.33475425839424133),
 ('beautiful', 0.32356536388397217)]

In [6]:
word1 = "Cheapest"
word2 = "friendly"

# retrieve the actual vector
# print(w2v_model.wv[word1])

# compare
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))


-0.041342914
[('speed', 0.3675766587257385), ('She', 0.33542460203170776), ('hoping', 0.31432610750198364)]


## Document embeddings with `Doc2Vec`

In [7]:
df.head()

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...


### `Doc2Vec` parameters:

- **vector_size** (int, optional) – Dimensionality of the feature vectors.

- **window** (int, optional) – Maximum distance between the current and predicted word within a sentence.

- **hs** ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.

- **sample** (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

- **negative** (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

- **min_count** (int, optional) – Ignores all words with total frequency lower than this.

- **workers** (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).

- **epochs** (int, optional) – Number of iterations (epochs) over the corpus.

- **dm** ({1,0}, optional) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.

- **dbow_words** ({1,0}, optional) – If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).



The interface for Doc2Vec is almost the same as for Word2vec. The main difference is that we need to give it a different input format, namely `TaggedDocument` objects.

In [8]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# create a list of TaggedDocument objects
corpus = []

for row in df.iterrows():
    label = row[1].score
    text = row[1].text
    corpus.append(TaggedDocument(words=text.split(), 
                                 tags=[str(label)]))

print('done')
# initialize model
d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

# build the vocabulary
d2v_model.build_vocab(corpus)

# train the model
d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

done


We can now look at the elements. Doc2vec stores the words as before in `wv`, and the document representations in `docvecs`

In [9]:
d2v_model.docvecs.doctags

{'5': Doctag(offset=0, word_count=4205492, doc_count=78827),
 '4': Doctag(offset=1, word_count=604853, doc_count=9164),
 '1': Doctag(offset=2, word_count=1205430, doc_count=7316),
 '2': Doctag(offset=3, word_count=301478, doc_count=2197),
 '3': Doctag(offset=4, word_count=254820, doc_count=2496)}

We can now similarly compare document categories to each other.

In [10]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)

[('3', 0.0817607194185257), ('4', 0.07102358341217041), ('5', 0.030553873628377914), ('2', -0.04616723209619522)]
