You can download the .gz file containing the **pre-trained word embeddings** from the following source:

`!wget -nc https://lazyprogrammer.me/course_files/nlp/GoogleNews-vectors-negative300.bin.gz`

The terminal command line below unzips the .gz file, and only needs to be run once. The bin folder should be extracted to the same folder as this notebook. For Windows users, however, you need to use *7-Zip* tool to extract the folder.

`tar -xvzf .\GoogleNews-vectors-negative300.bin.gz`

The bin folder contains pre-trained word embeddings in a **bin file**, and it is more common than you think that you will use word embeddings trained by someone else, since training a neural network takes a lot of resources and time. The downloaded model contains 300-dimensional vectors for 3 million words and phrases. **Gensim** library is the go-to library for word embeddings in NLP, and the `KeyedVectors` class allows you to load the pre-trained vectors and query them.

In [1]:
from gensim.models import KeyedVectors

In [3]:
# Load the pre-trained word vectors

word_vectors = KeyedVectors.load_word2vec_format(
    'data/GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin', 
    binary=True)

# Finding word analogies

In [4]:
def find_analogies(w1, w2, w3): 
    # w1 - w2 = ? - w3 
    # e.g. king - man = ? - woman 
    #      ? = +king +woman -man 
    
    r = word_vectors.most_similar(positive=[w1, w3], negative=[w2]) 
    print("%s - %s = %s - %s" % (w1, w2, r[0][0], w3))


In [5]:
find_analogies('king', 'man', 'woman')

king - man = queen - woman


In [6]:
find_analogies('france', 'paris', 'london')

france - paris = england - london


In [7]:
find_analogies('france', 'paris', 'rome')

france - paris = italy - rome


In [8]:
# If you reverse country-city to city-county it should work regardless
# This indicates word vectors are not perfectly associated

find_analogies('paris', 'france', 'italy')

paris - france = lohan - italy


In [9]:
find_analogies('france', 'french', 'english')

france - french = england - english


In [10]:
# Obviously the answer should be 'china' instead of 'tibet'
# Again indicates the word vectors most closely-related are not correct

find_analogies('japan', 'japanese', 'chinese')

japan - japanese = tibet - chinese


In [11]:
find_analogies('japan', 'japanese', 'italian')

japan - japanese = italy - italian


In [12]:
# Expected 'may' as the previous month in logic, but at least we still got a month
# Note that GloVe has more success with this type of thing

find_analogies('december', 'november', 'june')

december - november = september - june


In [13]:
find_analogies('miami', 'florida', 'texas')

miami - florida = dallas - texas


In [14]:
# Expected answer like 'picasso' or 'rembrandt' - who is Jude???

find_analogies('einstein', 'scientist', 'painter')

einstein - scientist = jude - painter


In [15]:
find_analogies('man', 'woman', 'she')

man - woman = he - she


In [16]:
find_analogies('man', 'woman', 'aunt')

man - woman = uncle - aunt


In [17]:
find_analogies('man', 'woman', 'sister')

man - woman = brother - sister


In [18]:
# Oops! Should be 'husband', not 'son'! These vectors are not perfect, maybe 'husband' is not top result...

find_analogies('man', 'woman', 'wife')

man - woman = son - wife


In [19]:
find_analogies('man', 'woman', 'actress')

man - woman = actor - actress


In [20]:
find_analogies('man', 'woman', 'mother')

man - woman = father - mother


In [21]:
find_analogies('nephew', 'niece', 'aunt')

nephew - niece = uncle - aunt


# Finding similar words

This is simpler function compared to the one used for word analogies - just finding the most similar words.

In [24]:
# Given a word, what are the most similar words?

def nearest_neighbors(w): 
    r = word_vectors.most_similar(positive=[w]) 
    print("neighbors of: %s" % w) 
    
    # Loop through results to extract words only
    for word, score in r: 
        print("\t%s" % word)


In [26]:
nearest_neighbors('king')

neighbors of: king
	kings
	queen
	monarch
	crown_prince
	prince
	sultan
	ruler
	princes
	Prince_Paras
	throne


In [27]:
nearest_neighbors('france')

neighbors of: france
	spain
	french
	germany
	europe
	italy
	england
	european
	belgium
	usa
	serbia


In [28]:
# Where is 'tibet'???

nearest_neighbors('japan')

neighbors of: japan
	japanese
	tokyo
	america
	europe
	germany
	chinese
	india
	hawaii
	usa
	korea


In [29]:
# Hmmm! Maybe too many people use 'einstein' sarcastically

nearest_neighbors('einstein')

neighbors of: einstein
	nikki
	lmfao
	albert
	armstrong
	joan
	becky
	mcmahon
	conrad
	lori
	haley


In [30]:
nearest_neighbors('woman')

neighbors of: woman
	man
	girl
	teenage_girl
	teenager
	lady
	teenaged_girl
	mother
	policewoman
	boy
	Woman


In [31]:
nearest_neighbors('nephew')

neighbors of: nephew
	son
	uncle
	brother
	grandson
	cousin
	father
	niece
	younger_brother
	nephews
	stepson


In [32]:
# Hmmm! All other months except for 'norway'...

nearest_neighbors('february')

neighbors of: february
	january
	april
	september
	december
	july
	october
	november
	june
	feb
	norway


**EXERCISE: Download pre-trained GloVe vectors glove.6B.zip from https://nlp.stanford.edu/projects/glove/.**

* **Implement your own `find_analogies()` and `nearest_neighbors()` custom functions**

**HINT:** you do NOT have to go hunting around on StackOverflow, you do NOT have to copy-and-paste code from anywhere external, and make sure to look at the file you downloaded.

**NOTE:** The GloVe pre-trained embeddings is an 822 MB zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary. It is a large file to download so make sure you have space on your computer. If you successfully download the text file, you can run the word analogy and nearest neighbour functions on each vector space (50, 100, 200 and 300) to observe the differences between Word2Vec and GloVe.

In [None]:
# glove.6B.zip TOO LARGE TO DOWNLOAD! MAKE SPACE IN STORAGE


# <-- Copied-and-pasted from online -->

# Convert txt file to word2vec
from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'data/glove.6B.100d.txt'
word2vec_output_file = 'data/glove.6B.100d.txt.word2vec'

# Convert GloVe model to Word2vec model
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
# Load the Stanford GloVe model using same KeyedVectors class but ensure binary is False
model = KeyedVectors.load_word2vec_format('data/glove.6B.100d.txt.word2vec', binary=False)

# Calculate: (king - man) + woman = ? analogy
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

In [None]:
# Find analogies with converted GloVe model

def find_analogies(w1, w2, w3): 
    # w1 - w2 = ? - w3 
    # e.g. king - man = ? - woman 
    #      ? = +king +woman -man 
    
    r = model.most_similar(positive=[w1, w3], negative=[w2]) 
    print("%s - %s = %s - %s" % (w1, w2, r[0][0], w3))
