Now that we have word vectors, what can we do?
----

Math with words!

<center><img src="http://rlv.zcache.com/math_is_awesome_poster-re24bd4726be24b82acc1d83fe7b4a8e4_cru_8byvr_512.jpg" width="700"/></center>

Types of Word Math
----

1. Distance
2. Arithmetic
3. Clustering

1. Distance
----

<center><img src="http://blog.krecan.net/wp-content/family.png" width="600"/></center>

The relationships between words can encoded as distance through the space.

Words that are related will be closer than unrelate words.

Ways to measure distance
----
<center><img src="http://i1.wp.com/dataaspirant.com/wp-content/uploads/2015/04/euclidean.png?w=600" width="700"/></center>

<center><img src="http://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/manhattan.png?w=600" width="700"/></center>

<center><img src="http://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png?resize=610%2C468" width="700"/></center>

[Read more here](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

Check for understanding
-----

Which distance metric should we use for word vectors and why?

__Cosine similarity__ is most often used in NLP.

If everything is normalized unit vectors, then the two calculations are equivalent. 

In general, use the cosine similarity since it removes the effect of document length.

The cosine distance is equivalent to the inner product of the normalized word vectors (e.g scaling the vectors so their length becomes equal to 1.) 

You can very loosely interpret the inner product of word vectors to be an approximation of the pointwise mutual information (PMI) between the two words they represent.

[Source](https://github.com/facebookresearch/fastText/issues/189)

<center><img src="https://upload.wikimedia.org/math/f/3/6/f369863aa2814d6e283f859986a1574d.png" width="900"/></center>

Interpreting cosine value:

1 : word vectors are exactly the same  
0 : word vectors are orthogonal (mathematically unrelated)  

__Student Activity__: Write a function that implements cosine similarity 

In [33]:
import numpy as np
from scipy.spatial.distance import cosine

def test_cos_sim():
    v1 = np.array([1, 2, 3])
    assert cos_sim(v1, v1) == 1
    assert cos_sim(v1, v1) == (1 - cosine(v1, v1)) # Distance is the inverse of similarity
    
    v2 = np.array([-1, -2, -3])
    assert cos_sim(v1, v2) == -1
    assert cos_sim(v1, v2) == (1 - cosine(v1, v2)) # Distance is the inverse of similarity
    
    v3 = np.array([0, 3])
    v4 = np.array([4, 0])
    assert cos_sim(v3, v4) == 0
    assert cos_sim(v3, v4) == (1 - cosine(v3, v4)) # Distance is the inverse of similarity
    
    v5 = np.array([3, 45, 7, 2])
    v6 = np.array([2, 54, 13, 15])
    assert round(cos_sim(v5, v6), 4) == round(0.97228425171235, 4)
    assert round(cos_sim(v5, v6), 4) == (1 - round(cosine(v5, v6), 4))
    
    return "tests pass 🙂"
    

In [35]:
def cos_sim(v1, v2):
    "Calculate cosine similarity between vector 1 and 2"
    return v1.dot(v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print(test_cos_sim())

tests pass 🙂


Words closest to “Sweden”
----

<center><img src="http://deeplearning4j.org/img/sweden_cosine_distance.png" width="700"/></center>

<center><img src="images/joke.png" width="700"/></center>

Check for understanding
-----

What is a limitation of word vectors (aka grouping words by correlation)?

Antonyms appear to near each other (in both the data and the vector space)!

__Examples__:

- and / or
- good / bad

2. Arithmetic: Word analogies
---

The "Hello, world!" of word2vec:
> Man is to woman as king is to queen

$cos(w, king) - cos(w, man) + cos(w, woman) = cos(w, queen)$

<center><img src="http://multithreaded.stitchfix.com/assets/images/blog/vectors.gif" width="500"/></center>

[Let's play with an analogy demo](http://rare-technologies.com/word2vec-tutorial/#app)

Different paths (vectors) through the space encode different relationships.
----


Plurals (Models Suffixes)
-----
<center><img src="images/plurals.png" width="700"/></center>

Verb Tense (Models Inflections)
-----

<center><img src="images/verb.png" width="700"/></center>

Country-Capital (Models real-world relationships)
-----

<center><img src="images/country.png" width="700"/></center>

How can you use word2vec to build data products?
----

<center><img src="https://assets.toptal.io/uploads/blog/image/827/toptal-blog-image-1423052243609.jpg" width="400"/></center>

When I worked at an employment website, I built a recommendation engine for job seekers. 

The job seeker would have a resume and we would suggest jobs for them. 

My goal was given a current job title, suggest a "better" job. This would increase platform engagement.

What would be next logical career move from a Babysitter?
------


________ is to __Babysitter__ as __Senior Engineer__ is to __Engineer__.

__Nanny__ is to __Babysitter__ as __Senior Engineer__ is to __Engineer__.

3) Clustering
----

<center><img src="http://colah.github.io/posts/2015-01-Visualizing-Representations/img/words-pic.png" width="700"/></center>

Use your favorite clustering algo! (K-means is a good start)

[Example 1](http://colah.github.io/posts/2015-01-Visualizing-Representations/)

[Example 2](http://douglasduhaime.com/blog/clustering-semantic-vectors-with-python)

How can we could we unit test our word2vec model <br> (especially if it is built on a custom corpus)?
-----

Word2Vec is an unsupervised learning algorithm. Thus there is no good way to objectively evaluate the result. 

There is a testing approach!

One possible method is to compare analogies performance with [Google analogy test set](https://aclweb.org/aclwiki/Google_analogy_test_set_(State_of_the_art).

Summary
---
- After training, any vector operations can be applied to words. 
- The most common operations are: 
    - Arithmetic (add and subtract)
    - Distance (typically using Cosine Similarity)
    - Clustering (typically K-Means)
- When possible test, machine learning systems.

<br>
<br>
<br>

----