<a href="https://colab.research.google.com/github/MarinaWolters/Coding-Tracker/blob/master/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo of Word Embeddings for CIS 5300


First, we're going to install a software package called Magnitude that allows for the fast efficient manipuation of word vectors.  If you'd like to learn more about it, you can read our [EMNLP 2018 paper about Magnitude](http://www.cis.upenn.edu/~ccb/publications/magnitude-fast-efficient-vector-embeddings-in-python.pdf), or you can read the [Magnitude developer documentation on Github](](https://github.com/plasticityai/magnitude)).<p>Then, we'll download a set of pre-trained word vectors that are stored in the Magnitude file format.  This file is several gigabytes large, so it will take a few minutes to download. *(To execute the code, all you need to do is press the play button below).*

In [None]:
# Install TensorFlow + Keras
!pip install -q tensorflow keras

# Install Magnitude on Google Colab
! echo "Installing Magnitude.... (please wait, can take a while)"
!pip install -r https://raw.githubusercontent.com/plasticityai/magnitude/master/requirements.txt
!(curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1>/dev/null 2>/dev/null)
!pip install spacy==3.1.2 1>/dev/null 2>/dev/null
try:
  from pymagnitude import *
except Exception:
  pass
from pymagnitude import *
! echo "Done installing Magnitude."

# Download GloVe vectors
!curl -s http://magnitude.plasticity.ai/glove/medium/glove.6B.50d.magnitude --output vectors.magnitude
# Uncomment to use word2vec instead: !curl -s http://magnitude.plasticity.ai/word2vec+subword/GoogleNews-vectors-negative300.magnitude --output vectors.magnitude
# Uncomment to use fastText instead: !curl -s http://magnitude.plasticity.ai/fasttext+subword/wiki-news-300d-1M.magnitude --output vectors.magnitude

Installing Magnitude.... (please wait, can take a while)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   137  100   137    0     0    678      0 --:--:-- --:--:-- --:--:--   678




Done installing Magnitude.


Download a word embeddings file.

In [None]:
#!wget http://magnitude.plasticity.ai/glove/heavy/glove.6B.300d.magnitude
!wget http://magnitude.plasticity.ai/word2vec/medium/GoogleNews-vectors-negative300.magnitude


--2023-03-21 15:24:11--  http://magnitude.plasticity.ai/word2vec/medium/GoogleNews-vectors-negative300.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 54.231.203.133, 52.217.114.133, 52.216.41.157, ...
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|54.231.203.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5295644672 (4.9G) [binary/octet-stream]
Saving to: ‘GoogleNews-vectors-negative300.magnitude.1’


2023-03-21 15:26:19 (39.6 MB/s) - ‘GoogleNews-vectors-negative300.magnitude.1’ saved [5295644672/5295644672]



After the files have downloaded, we can start running Python and Magnitude!  We will load a set of vectors from the file that we just downloaded.<p>One the vectors are loaded, we can see how many vectors we've loaded in with this command.  This means that we have vectors representing this many words.  This is the size of our **vocabulary.**  

In [None]:
from pymagnitude import *
#vectors = Magnitude("glove.6B.300d.magnitude")
vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")

print("The number of words with vector representations in this file is %s." % len(vectors))

The number of words with vector representations in this file is 3000000.


We can see what the *dimensionality* of each vector is.  The dimensionality is just the length of the vector.  


In [None]:
vectors.dim

300

In [None]:
"cat" in vectors

True

In [None]:
vectors.most_similar("excellent")

[('terrific', 0.7409729),
 ('superb', 0.7062715),
 ('exceptional', 0.68147063),
 ('fantastic', 0.6802847),
 ('good', 0.644293),
 ('great', 0.6124601),
 ('Excellent', 0.6091997),
 ('impeccable', 0.5980968),
 ('exemplary', 0.5959651),
 ('marvelous', 0.58292854)]

We can print out what a vector look likes.  It should have a bunch of real-valued numbers (positive or negative).  The number of values that we will see is *vectors.dim*

In [None]:
vectors.query("cat")

array([ 0.0040587,  0.0671903, -0.0938735,  0.0713696,  0.0388996,
        0.0273262,  0.0163957, -0.0031345,  0.0726555, -0.0414715,
        0.0265225, -0.1928908, -0.0014668, -0.0977313, -0.00432  ,
       -0.0274869,  0.0166368,  0.0498301, -0.1478829, -0.0044606,
        0.0707266, -0.0485442,  0.0739415, -0.04115  , -0.0319877,
        0.0819786, -0.0951595,  0.1202353,  0.1356665, -0.0282906,
       -0.0258795, -0.0649399, -0.0298981, -0.0466153, -0.0337559,
        0.0430789, -0.0011403,  0.0237899,  0.0145472,  0.1138056,
        0.0245936, -0.0369707,  0.0221824,  0.0369707,  0.0065101,
       -0.0406678,  0.0691192, -0.0237899, -0.0091623,  0.0182443,
       -0.1099478,  0.0281299,  0.1131626,  0.0459723,  0.016235 ,
       -0.0443649,  0.0536879, -0.1228071,  0.1305228,  0.0352026,
        0.072977 ,  0.0700837, -0.0295766,  0.0681547,  0.0294158,
       -0.0271655,  0.0196106,  0.0335951, -0.0633325, -0.0298981,
        0.1620283,  0.0130201, -0.0233076, -0.000658 , -0.0758

That's what a cat looks like according to our model!  It looks just like [this](https://en.wikipedia.org/wiki/Cat_intelligence#/media/File:An_up-close_picture_of_a_curious_male_domestic_shorthair_tabby_cat.jpg), right?  Well, not really, but the cool think about vectors is that they allow us to say how similar two things are.  So we can ask, how similar are *cats* and *dogs*?    The result will be a decimal between 0 and 1.0, with numbers closer to 1 indicating that the words are more similar.

In [None]:
vectors.similarity("professor", "cucumber")


0.056618165

Wait, isn't that  comparing apples and oranges?  No, but this is:

In [None]:
vectors.similarity("apples", "oranges")

0.69658417

The Magnitude software allows you to query for the most similar word out of a list of words using the command *most_similar_to_given*, which takes a query word and then a list of other words to compare it to.

In [None]:
vectors.most_similar_to_given("kittens", ["oranges", "strawberries", "tomatoes", "cats", "dogs", "trees", ])

'cats'

We can also look for the word vectors that are most similar to a query word.  Here are the words that are most similar to *apples*.  Try replacing the word *apples* with whatever word you want, and re-running this cell (by pressing the play button again), and see what the most similar words are to the word that you entered.


# Polysemous words

In early word embeddings poleysemous words were a problem.  Polysemous words are words that have more than 1 meaning.  For example, "bug" can have lots of different meanings:
* A creepy-crawly thing
* Something that makes you ill
* An error in your code
* A covert listening device
* (Verb) be annoying  

In word2vec, all of these different meanings get averaged into one vector.

In [None]:
vectors.most_similar("bug", topn = 20)

[('bugs', 0.7603134),
 ('worm', 0.6060655),
 ('Bug', 0.569483),
 ('virus', 0.55493534),
 ('Y2K_millennium', 0.53581625),
 ('ActiveX_vulnerability', 0.53124774),
 ('http://bugs.gentoo.org', 0.525815),
 ('vuln', 0.5222951),
 ('insect', 0.51954466),
 ('Bagle_virus', 0.5150193),
 ('worms', 0.5076382),
 ('brown_recluse_spider', 0.5051397),
 ('Virut', 0.5046243),
 ('viruses', 0.50372773),
 ('flaw', 0.5024464),
 ('Brown_marmorated_stink', 0.50240886),
 ('SoBig_virus', 0.5019326),
 ('Koobface_virus', 0.50151485),
 ('spoofing_vulnerability', 0.50019515),
 ('tar_remover', 0.5000378)]

Polysemy is very common.  For instance, "apple" could be the fruit or the company.  Since the dominant sense of apple in our corpus is the fruit, then all of the its most similar vectors are fruits and not computers.

In [None]:
vectors.most_similar("apple", topn = 20)

[('apples', 0.7203598),
 ('pear', 0.6450697),
 ('fruit', 0.6410147),
 ('berry', 0.6302295),
 ('pears', 0.6133961),
 ('strawberry', 0.60582614),
 ('peach', 0.6025872),
 ('potato', 0.5960934),
 ('grape', 0.59358644),
 ('blueberry', 0.5866668),
 ('cherries', 0.5784383),
 ('mango', 0.57518566),
 ('apricot', 0.57277775),
 ('melon', 0.5719985),
 ('almond', 0.570483),
 ('Granny_Smiths', 0.56953335),
 ('grapes', 0.56922567),
 ('peaches', 0.5659247),
 ('pumpkin', 0.5651883),
 ('apricots', 0.56455696)]

However, we can average together vectors to get a different result.

In [None]:
apple = vectors.query("apple")
computer = vectors.query("computer")
averaged = (apple + computer)/2
vectors.most_similar(averaged, topn = 20)

[('computer', 0.5838944),
 ('apple', 0.58389425),
 ('computers', 0.44672984),
 ('laptop', 0.44189417),
 ('Apple_IIe', 0.40202093),
 ('apples', 0.39580956),
 ('iMac', 0.3923599),
 ('receive_MacMall_Exclusive', 0.3897292),
 ('laptop_computer', 0.38527435),
 ('cartoonish_apple', 0.38067073),
 ('iBook_laptop', 0.3763073),
 ('Macbook_laptop', 0.37304696),
 ('hackintosh', 0.3725756),
 ('mainframes_minicomputers', 0.37218946),
 ('Marco_Boglione', 0.3710637),
 ('iBook', 0.3694635),
 ('Surface_tabletop', 0.36521816),
 ('electric_typewriter', 0.36513108),
 ('logs_keystrokes', 0.36422026),
 ('view_HealthCast', 0.36404467)]

Here's what's happening under the hood.  I'm showing the first few numbers in the vectors, just to make it easier to see.

In [None]:
apple[:3]

array([-0.0205091, -0.0509621, -0.0038455], dtype=float32)

In [None]:
computer[:3]

array([ 0.0408137, -0.076433 ,  0.0467503], dtype=float32)

In [None]:
added = (apple+computer)
added[:3]

array([ 0.0203046 , -0.12739511,  0.0429048 ], dtype=float32)

In [None]:
averaged = (apple+computer)/2
averaged[:3]

array([ 0.0101523 , -0.06369755,  0.0214524 ], dtype=float32)

Note that capitalization matters:
apple versus Apple

In [None]:
vectors.most_similar("Apple", topn = 20)

[('Apple_AAPL', 0.74569863),
 ('Apple_Nasdaq_AAPL', 0.730041),
 ('Apple_NASDAQ_AAPL', 0.7175089),
 ('Apple_Computer', 0.71459734),
 ('iPhone', 0.6924266),
 ('Apple_NSDQ_AAPL', 0.68686044),
 ('Steve_Jobs', 0.6758423),
 ('iPad', 0.6580769),
 ('Apple_nasdaq_AAPL', 0.64449704),
 ('AAPL_PriceWatch_Alert', 0.6439754),
 ('Apple_iPad', 0.6227746),
 ('iPhones', 0.6192503),
 ('Nexus_One', 0.6192277),
 ('Appleâ_€_™', 0.6176695),
 ('Apple_AAPL_iPhone', 0.615996),
 ('Apple_AAPL_Fortune', 0.6144309),
 ('Apple_Inc_AAPL.O', 0.61417174),
 ('Mac_cloner_Psystar', 0.6066086),
 ('Apple_APPL', 0.60539895),
 ('iPod', 0.6045314)]

In [None]:
vectors.most_similar((vectors.query("delicious") + vectors.query("food")), topn = 20)

[('food', 1.3721681),
 ('delicious', 1.3721678),
 ('tasty', 1.2404301),
 ('scrumptious', 1.2201036),
 ('yummy', 1.1655107),
 ('Generous_portions', 1.1525309),
 ('savory_delights', 1.1455574),
 ('Rotisserie_chickens', 1.1427587),
 ('delicious_nutritious', 1.1423676),
 ('vegan_dishes', 1.1381708),
 ('crunchy_salty', 1.1376917),
 ('delectable', 1.1263512),
 ('tasty_snacks', 1.1258788),
 ('scrumptious_desserts', 1.1242514),
 ('foods', 1.1204889),
 ('gourmet', 1.1174433),
 ('sinful_desserts', 1.106785),
 ('nutritious', 1.1001396),
 ('Rotisserie_chicken', 1.0926499),
 ('delectable_dessert', 1.0915829)]

In subsequent models like ELMo and BERT, word embeddings were computed for sentences rather than individual words, allowing the surrounding words in the sentence to influence the vector for each word token.  These context-based word embeddings allowed for unique word embeddings for each word instance.

# Solving word analogy problems

Magnitude also alows us to test the analogy solving capabilities of word vectors.  For analogy problems like "***man*** is to ***king*** as ***woman*** is to **-----**" we are doing some vector arithmetic.   We take the vector for *king*, subtract the vector for *king*, and then add the vector for *woman*:<p>+ *king* <p>- *man*<p>+ *woman*<p>The result is a vector.  To figure out what word is closest to it, we find the most similar word vectors to the vector that resulted from our arithmetic.

In [None]:
vectors.most_similar(positive = ["king", "woman"], negative = ["man"])

[('queen', 0.71181935),
 ('monarch', 0.61896753),
 ('princess', 0.5902431),
 ('crown_prince', 0.5499462),
 ('prince', 0.5377322),
 ('kings', 0.5236845),
 ('Queen_Consort', 0.5235946),
 ('queens', 0.5181134),
 ('sultan', 0.5098594),
 ('monarchy', 0.50874126)]

You can try other gender based analogy problems like: <p>***man*** is to ***actor*** as ***woman*** is to what?<p><p>***man*** is to ***congressman*** as ***woman*** is to what?<p>Try out other analogy problems on your own.  Ones related to countries often work.  For instance, <p>***london*** is to ***uk*** as ***paris*** is to what?<p>

In [None]:
vectors.most_similar(positive = ["congressman", "woman"], negative = ["man"])

[('congresswoman', 0.67109746),
 ('senator', 0.64611745),
 ('Congresswoman', 0.63603246),
 ('Congressman', 0.63283956),
 ('lawmaker', 0.6258571),
 ('congressmen', 0.59148425),
 ('Rep.', 0.5858079),
 ('congressional', 0.56658834),
 ('Congressional_District', 0.5613279),
 ('Congressmember', 0.5602188)]

# Bias in word vectors

Negative society biases appear in word vectors, since they are trained on data that contain those biases.   

A classic example of bias in word analogy problems was demonstrated in [this paper](https://arxiv.org/abs/1607.06520).  

Out-dated stereotypes of women are revealed in the word2vec embeddings when you use them to solve the analogy problem "***man*** is to ***computer_programmer*** as ***woman*** is to **-----**"


*(Be sure you are using these vectors `vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")` to see this effect.  It's not in all word embeddings).*


In [None]:
vectors.most_similar(positive = ["computer_programmer", "woman"], negative = ["man"])

[('homemaker', 0.5627119),
 ('housewife', 0.5105047),
 ('graphic_designer', 0.50518024),
 ('schoolteacher', 0.49794948),
 ('businesswoman', 0.4934892),
 ('paralegal', 0.49255118),
 ('registered_nurse', 0.49079752),
 ('saleswoman', 0.48816282),
 ('electrical_engineer', 0.47977278),
 ('mechanical_engineer', 0.47553998)]