### Word2Vec
**Word2Vec** is a two-layer neural net that processes text
> * Input is text corpus
> * Output is set of vectors (i.e. feature vectors for words in corpus)
> > * Vectors that are distributed numerical representations of word features, features such as the context of individual words

**Purpose** and usefulness of Word2Vec is to group the vectors of similar words together in vectorspace
> * It detects similarities mathematically

**Outcome** of giving Word2Vec enough data is making highly accurate guesses about a word's meaning based on past appearances

**Trains** aginst other words that neighbor them in the input corpus
> 1. using context to predict a target word - method known as continuous bag of words (CBOW)
> 2. Using word to predict a target context - Skip-Gram

<img src='CBOW_Skip.png'>

**Recall** each word is now represented by a vector
> With spacy, each vector has **300** dimensions

**Cosine Similarity** is a measure of similarity between vectors
> * Now that we have our words in vectors, we can evaluate their relationships

<img src='Cos_Sim.png'>

**Vector Arithmetic** can now be performed
> e.g. finding the difference of man from king plus woman could be related to queen

$$new vector = king - man + woman \approx queen$$
<img src='Vec_Sim_Example.png'>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_lg")

In [6]:
# This illustrates the vector components of the word 'lion',
# Doc and Span objects also have vectors, and are derived from the averages of the indivdual token vectors
# That allows you to perform document2vec, not only word2vec
display(nlp(u"lion").vector[0:10])
display(nlp(u"lion").vector.shape)

array([ 0.18963 , -0.40309 ,  0.3535  , -0.47907 , -0.43311 ,  0.23857 ,
        0.26962 ,  0.064332,  0.30767 ,  1.3712  ], dtype=float32)

(300,)

In [8]:
# Doc2Vec is the average of all the singular words that are there
display(nlp(u'The quick brown fox jumped').vector[0:10])
display(nlp(u'The quick brown fox jumped').vector.shape)

array([-0.209218  , -0.0278228 , -0.0357064 ,  0.1552184 , -0.012805  ,
        0.13162704, -0.19946599,  0.0475812 ,  0.1267988 ,  1.647928  ],
      dtype=float32)

(300,)

In [9]:
tokens = nlp(u'lion cat pet')

In [10]:
# This numerical value is the cosine similarity
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265437
lion pet 0.39923772
cat lion 0.5265437
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923772
pet cat 0.7505456
pet pet 1.0


In [11]:
tokens = nlp(u'like love hate')

for t1 in tokens:
    for t2 in tokens:
        print(t1.text, t2.text, t1.similarity(t2))

like like 1.0
like love 0.65790397
like hate 0.6574652
love like 0.65790397
love love 1.0
love hate 0.6393099
hate like 0.6574652
hate love 0.6393099
hate hate 1.0


In [14]:
# Aggregate this into a Euclidean L2 norm: square root of the sum of squared vectors
# spacy has a method for this
display(len(nlp.vocab.vectors))
nlp.vocab.vectors.shape

684831

(684831, 300)

In [None]:
tokens = nlp(u'dog cat nargle')

