## Introduction

In this word embedding analysis, I will explore candidate words from selected seeds words with a pretrained word2vec model.

### Download Google word2vec pretrained model

In [None]:
# Download the file from the internet
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2022-03-18 20:36:10--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.237
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2022-03-18 20:36:45 (45.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [None]:
# uncompress the gzip file
!gzip -d GoogleNews-vectors-negative300.bin.gz

In [None]:
# import libraries
import pandas as pd
from gensim.models import KeyedVectors
import numpy as np
import json

In [None]:
import io
from google.colab import files

In [None]:
uploaded = files.upload()

Saving Seed words.xlsx to Seed words.xlsx


In [None]:
# load the seeds data from the excel
seeds_data_1 = pd.read_excel('Seed words.xlsx', sheet_name='Synoym')
seeds_data_2 = pd.read_excel('Seed words.xlsx', sheet_name='Ant')

In [None]:
seeds_data_1.sample(5)

Unnamed: 0,original seed word,explaination,expanded seed word
236,liberty,freedom,sanction
319,right,"sane, healthy",hale
83,free,unrestrained politically,unconstrained
623,limit,"confine, restrict",cork
468,oppression,"misery, hardship",cruelty


In [None]:
seeds_data_2.sample(5)

Unnamed: 0,original seed words,expanded seed words
82,restrict,unfasten
16,liberty,work
36,allow,hold
77,restrict,open
67,restrict,free


In [None]:
seeds_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 632 entries, 0 to 631
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original seed word  632 non-null    object
 1   explaination        632 non-null    object
 2   expanded seed word  631 non-null    object
dtypes: object(3)
memory usage: 14.9+ KB


In [None]:
seeds_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   original seed words  117 non-null    object
 1   expanded seed words  117 non-null    object
dtypes: object(2)
memory usage: 2.0+ KB


The cell above shows us that there is 632 rows in the first sheet `Synoym` and 117 row in the second one `Ant`, 749 seed words. 
next, I get candidate words from it.

In [None]:
seeds = set(seeds_data_1["original seed word"].tolist() + seeds_data_2["original seed words"].tolist())

In [None]:
# used the set function in order to deduplicate the seed terms from the two lists and keep only unique ones
seeds = list(seeds)
# clean the list from nan
clean_seeds = [x for x in seeds if str(x) != 'nan']

In [None]:
print(f"we have exactly {len(clean_seeds)} seed terms in our list")

we have exactly 16 seed terms in our list


### Load the Google word embedding model

In [None]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=200000)


In [None]:
# test the loaded word vectors model
word_vectors['freedom'].shape

(300,)

In [None]:
# loop over each seed words from the word embedding space to find similar words
def get_similar_words(similar_words):
  return list(set([elm[0].lower().replace('_', ' ') for elm in similar_words]))

In [None]:
# set it in a dictionary where the key is the seed word and the values is a list of the feeding words.
# initialize a word dictionary for Google News model

data_dictionary_GN = {}

# here we set how many words we want per seed term
n = 20

non_existent_seed = 0
for seed in clean_seeds : 
  try:
    data_dictionary_GN[seed] = get_similar_words(word_vectors.most_similar(seed, topn=n))
  except:
    non_existent_seed += 1
    print(f"the word {seed} doesn't exist in the word embedding space")

the word allow  doesn't exist in the word embedding space
the word Interfere doesn't exist in the word embedding space
the word oppression  doesn't exist in the word embedding space
the word restrict  doesn't exist in the word embedding space


In [None]:
print(f"there is {non_existent_seed} seed terms that doesn't exist in the word embedding space")

there is 4 seed terms that doesn't exist in the word embedding space


In [None]:
data_dictionary_GN

{'Choice': ['distinction',
  'direct',
  'favourite',
  'affordable',
  'free choice',
  'healthy',
  'choice',
  'choices',
  'ultimate',
  'best',
  'choose',
  'voice',
  'choice awards',
  'favorite',
  'connections',
  'advantage',
  'voucher',
  'choice award'],
 'allow': ['compel',
  'allow',
  'allowing',
  'enables',
  'enabled',
  'allowed',
  'enabling',
  'encourage',
  'enable',
  'let',
  'lets',
  'restrict',
  'give',
  'allows',
  'facilitate',
  'permitted',
  'requiring',
  'require'],
 'choice': ['option',
  'selection',
  'preferred',
  'preference',
  'chosen',
  'choosing',
  'brainer',
  'chose',
  'selected',
  'pick',
  'choices',
  'decision',
  'select',
  'choice',
  'options',
  'selections',
  'selecting',
  'choose'],
 'free': ['free',
  'unlimited',
  'freebie',
  'give aways',
  'giveaway',
  'unrestricted',
  'freebies',
  'unfettered',
  'discounted',
  'visit http://www.comtex.com',
  'complimentary',
  'open',
  'unhindered',
  'giveaways',
  'rest

This word2vec model is unable to figure out the nearest neighbors words of 71 seed words. I try the same logic with another word embedding model below.

In [None]:
# download the model and return as object ready for use
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100") 



In [None]:
# initialize a word dictionary
data_dictionary_glove = {}

# here we set how many words we want per seed term
n = 20

non_existent_seed = 0
for seed in clean_seeds : 
  try:
    data_dictionary_glove[seed] = get_similar_words(model.most_similar(seed, topn=n))
  except:
    non_existent_seed += 1
    print(f"the word {seed} doesn't exist in the word embedding space")

the word allow  doesn't exist in the word embedding space
the word Interfere doesn't exist in the word embedding space
the word oppression  doesn't exist in the word embedding space
the word restrict  doesn't exist in the word embedding space
the word Choice doesn't exist in the word embedding space


In [None]:
print(f"there is {non_existent_seed} seed terms that doesn't exist in the word embedding space")

there is 4 seed terms that doesn't exist in the word embedding space


The approach with Glove word embedding is reaching out more seed terms than the Google word2vec model

### Save Results
In order to save results I will put them into a json file. The structure will be more conveninet and ther will be no need to repeath the same seed words at each line. This way every seed term is a key and their values is a list with the top 20 terms that are semantically closer.

In [None]:
# Save results for the 1st approach
with open('seeds_most_similar_with_GoogleNews.json', 'w') as fp:
    json.dump(data_dictionary_GN, fp, sort_keys=True, indent=4)

In [None]:
# Save results for the 2nd approach ## the better one
with open('seeds_most_similar_with_Glove.json', 'w') as fp:
    json.dump(data_dictionary_glove, fp, sort_keys=True, indent=4)