# TD9 - Vectoring Words the "Right Way"

Let's install some libraries. Whilst downloading the libraries, we will explain how we can embed vectors better than we did in the RNN TD.

\- Embed words >>> embed letters :

    "I would rather be able to see the last 50 words than the last 50 characters",

    "With letters, very little notion of distance between characters (is "r" closer to "d" than it is to "c"?"),

    "With letters, if our network has to output words it will spend a lot of time learning how to predict letters so that the whole word makes sense. Let's give it the possibility to output words in a dictionary directly (if you predict dog vs cat, don't put 1000 neurons on the last layer)"

\- In images, if you change the value of an input pixel a little bit, it doesn't change much the image.

\- With words, if you one-hot encode them and you from (1, 0, 0, 0, 0, ...) to (0, 1, 0, 0, 0, 0, ...) you go from "a" to "able" (https://i.imgur.com/Z7aF0OS.jpeg). 

\- In images, if you output an image very close to what you want (imagine a task of going from black & white to colours), the pixel values are going to be very similar to what you want and there is a way to tell the network "you failed but you're close" or "you failed and you're far from where I'd want you to be".

\- With words, if you one-hot encode them you can just tell the network "you failed" or "yeah that's the right word". But if the network outputs "vehicle" instead "car" no way of saying: "almost!".

In [1]:
%pip install scikit-learn==0.21.3
%pip install wget==3.2
%pip install gensim==3.6.0
%pip install psutil==5.4.8
%pip install spacy==2.2.4

Collecting scikit-learn==0.21.3
  Downloading scikit-learn-0.21.3.tar.gz (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 3.6 MB/s eta 0:00:01
Building wheels for collected packages: scikit-learn
  Building wheel for scikit-learn (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/lhotte_rom-46643/pip-install-k8sekqvf/scikit-learn/setup.py'"'"'; __file__='"'"'/tmp/lhotte_rom-46643/pip-install-k8sekqvf/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/lhotte_rom-46643/pip-wheel-xutwp82o
       cwd: /tmp/lhotte_rom-46643/pip-install-k8sekqvf/scikit-learn/
  Complete output (28 lines):
  Partial import of sklearn during the build process.
  Traceback (most recent call last):
    File "<string>", line 

Let's download the model (trained on Google News and a latent space of size 300). Whilst downloading the model, we will explain how we can embed vectors better than we did in the RNN TD.

In [4]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



What does an encoded word look like?

In [5]:
wv['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

Instea of priting the array, maybe printing its shape, maximum and minimum values will be more insightful

In [7]:
wv['computer'].shape, wv['computer'].max(), wv['computer'].min()

((300,), 0.421875, -0.53515625)

Let's see the distance between words that are close from a spelling point of view and words that are close from a "lexical" point of view

In [13]:
wv.distance("Paris", "Party"), wv.distance("car", "cat"), wv.distance("car", "Lamborghini"), wv.distance("car", "vehicle")

(1.0615220293402672,
 0.7847181558609009,
 0.48789793252944946,
 0.21789032220840454)

In [14]:
wv.most_similar("Nike")

[('Adidas', 0.7950947284698486),
 ('adidas', 0.7668235898017883),
 ('NIKE', 0.7274937629699707),
 ('Nike_NYSE_NKE', 0.6847589015960693),
 ('Reebok', 0.663459300994873),
 ('Nike_NKE', 0.6379507184028625),
 ('Under_Armour', 0.5997496843338013),
 ('spokeswoman_Yoko_Mizukami', 0.5990645885467529),
 ('Adidas_ADDDY.PK_news', 0.5953589081764221),
 ('spokeswoman_Joani_Komlos', 0.5935247540473938)]

In [15]:
wv.most_similar("Paris")

[('Parisian', 0.6789354681968689),
 ('Hopital_Europeen_Georges_Pompidou', 0.6536554098129272),
 ('Spyker_D##_Peking', 0.6336591839790344),
 ('France', 0.633491039276123),
 ('Pantheon_Sorbonne', 0.6312517523765564),
 ('Aeroports_De', 0.621803879737854),
 ('Grigny_south', 0.6194689273834229),
 ('Place_Denfert_Rochereau', 0.6028153896331787),
 ('guest_Olivier_Dolige', 0.6024351119995117),
 ('Lazard_Freres_Banque', 0.5998712182044983)]

In [17]:
wv.most_similar_cosmul(positive=["king", "woman"], negative=["man"])

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566076278687),
 ('Queen_Consort', 0.8150269985198975),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.808997631072998),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.801961362361908),
 ('prince', 0.800979733467102),
 ('empress', 0.7958388328552246)]

In [18]:
wv.most_similar_cosmul(positive=["boy", "woman"], negative=["man"])

[('girl', 1.018757939338684),
 ('mother', 0.9110710024833679),
 ('teenage_girl', 0.9046094417572021),
 ('toddler', 0.8983652591705322),
 ('daughter', 0.8961495757102966),
 ('child', 0.8923880457878113),
 ('teenager', 0.8660117387771606),
 ('niece', 0.8555038571357727),
 ('schoolgirl', 0.8544197678565979),
 ('baby', 0.8516771197319031)]

In [19]:
wv.most_similar_cosmul(positive=["cow", "oink"], negative=["pig"])

[('moos', 0.8037621378898621),
 ('baa', 0.7783335447311401),
 ('baaa', 0.777667224407196),
 ('oinks', 0.7757751941680908),
 ('mooing', 0.773527979850769),
 ('mooed', 0.7722421884536743),
 ('neighs', 0.7689445614814758),
 ('cows_mooing', 0.7671400904655457),
 ('whinnies', 0.7667444348335266),
 ('cock_doodle_doo', 0.7664759159088135)]

In [20]:
wv.most_similar_cosmul(positive=["cat", "oink"], negative=["pig"])

[('miaowing', 0.821600079536438),
 ('woofs', 0.8177747130393982),
 ('kitty_kitty', 0.8145589232444763),
 ('meowing', 0.8128674030303955),
 ('woof_woof', 0.8090406060218811),
 ('Yip_yip', 0.8033258318901062),
 ('meowed', 0.8021880984306335),
 ('cock_doodle_doo', 0.8019063472747803),
 ('chittering', 0.8003867864608765),
 ('purr', 0.7976775169372559)]

In [21]:
wv.most_similar_cosmul(positive=["dog", "oink"], negative=["pig"])

[('barks', 0.834601640701294),
 ('woofs', 0.8272683620452881),
 ('bark_incessantly', 0.817846953868866),
 ('barking_loudly', 0.8163576126098633),
 ('Tail_wagging', 0.8125094771385193),
 ('cats_meow', 0.810688853263855),
 ('dogs', 0.8093793988227844),
 ('barking', 0.8093082904815674),
 ('woof_woof', 0.8092902302742004),
 ('pooch', 0.8045751452445984)]

In [23]:
wv.most_similar_cosmul(positive=["Santa", "oink"], negative=["pig"])

[('HO_HO_HO', 0.9089415073394775),
 ('ho_ho_hoing', 0.9081220030784607),
 ('Ho_ho_ho', 0.8999334573745728),
 ('Santa_elves', 0.887882649898529),
 ('Mrs._Claus', 0.8841321468353271),
 ('Santa_Claus', 0.8761659264564514),
 ('Ho_Ho_Ho', 0.8700059056282043),
 ('Santa_Mrs._Claus', 0.8698497414588928),
 ('Ho_Ho_Hos', 0.8630191683769226),
 ('elves_reindeer', 0.8513056039810181)]