### Word Vecotors
Word vectors: numerical representation of the semantics of words. 
Instead of training the neural network to predict the meaning of the input words, the neural network is trained to predict words near the target word in the unlabeled sentences.

LSA or LDiA are good for documents classification, semantic search, and clustering, but the word vectors produced through them are not accurate enough to be used for semantic reasoning or classification and clustering of short phrases or compound words.

Two ways to train word2ve embeddings: 
1. skip-gram: predicts the context of words (output words) from a word of interest(the input word)
2. Continuous Bag of Words (CBOW) predicts the target word (the output word) from the newby words(input words)

### Skip-gram Approach: predict surrounding word from the given word

Traditional n-grams language model works in terms of discrete units that have no inherent relationship to one another, a continuous space model works in terms of word vectors where similar words are likely to have similar vectos.

Input: One hot encoding, dimension: volcabulary
Hidden: Recurrent NN
Output: Softmax, dimension: volcabulary

### Continuous Bag of Words Approach: predict central word from surrouding words
Continuous bag of words changes while sliding over the document

#### Skip-gram works well for small corpoa and rare term, while continuous bags of words work well for frequent word and is faster to train

Improvements on top of word2vec
1. Frequency Bi-Grams: frequently co-occurent 2-gram/3-grams are included as independent word in word2vec
2. Subsampling frequent tokens: common words like 'a' and 'the that don't carry significant information are sampled less often
3. Negative subsampling: If you train your word model with a small corpus, you might want to use a negative sampling rate of 5-20 samples. For larger corpi, you can reduce the sample rate as low as 2-5 samples as suggested by Mikolov and his team.

In [10]:
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import datapath

In [18]:
word_vectors = KeyedVectors.load_word2vec_format('/Users/sli/Projects/data/GoogleNews-vectors-negative300.bin', binary=True, 
                                                limit=200000)

In [22]:
word_vectors.most_similar(positive=['cooking', 'potatoes'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('cook', 0.6973531246185303),
 ('sweet_potatoes', 0.6600280404090881),
 ('vegetables', 0.6513738632202148),
 ('onions', 0.6512383818626404),
 ('baking', 0.6481684446334839)]

In [34]:
word_vectors.most_similar(positive=['harm'], topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('harms', 0.7176095247268677),
 ('harming', 0.6956215500831604),
 ('endanger', 0.6636940836906433),
 ('irreparable_damage', 0.6374310851097107),
 ('harmed', 0.6304865479469299),
 ('injure', 0.5920926928520203),
 ('harmful', 0.5859167575836182),
 ('damage', 0.5471370220184326),
 ('detrimental', 0.5405454635620117),
 ('irreparable_harm', 0.5390825867652893)]

In [36]:
word_vectors.doesnt_match("potatoes milk cake computer".split())

  if np.issubdtype(vec.dtype, np.int):


'computer'

In [37]:
word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=2)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.7118192911148071), ('monarch', 0.6189674735069275)]

In [39]:
word_vectors['phone'].shape

(300,)

### Generate a new Word vector representation

For tfidf and words counts, the concept of sentences have mostly been dropped. However, for the creation and learning of word2vec, the concept of sentence is very import. Therefore, the first step in creating the domain-specific word-vec is to convert the document/documents in to list of tokens.

In [43]:
token_list =[
  ['to', 'provide', 'early', 'intervention/early', 'childhood', 'special', 'education',
   'services', 'to', 'eligible', 'children', 'and', 'their', 'families'],
  ['essential', 'job', 'functions'],
  ['participate', 'as', 'a', 'transdisciplinary', 'team', 'member', 'to', 'complete',
   'educational', 'assessments', 'for']
]

In [45]:
from gensim.models.word2vec import Word2Vec

In [46]:
num_features = 300
min_word_count = 3
num_workers = 2
window_size = 6
subsampling = 1e-3

In [49]:
model = Word2Vec(
    token_list,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=window_size,
    sample=subsampling)