## Introduction
In this notebook, we will do word embedding analysis, within the help of a pretrained Glove model, we will explore what are the most similar terms to the ones we have in a seedlist.

In [1]:
# import needed libraries
import pandas as pd
from gensim.models import KeyedVectors
import numpy as np
import json

In [2]:
# load the seeds data from the excel
seeds_data_1 = pd.read_excel('Seed words.xlsx', sheet_name='Synoym')
seeds_data_2 = pd.read_excel('Seed words.xlsx', sheet_name='Ant')

In [3]:
seeds_data_1.sample(5)

Unnamed: 0,original seed word,explaination,expanded seed word
278,right,"accurate, precise",proper
96,free,not busy; unoccupied,not tied down
630,limit,"confine, restrict",ration
426,allow,permit an action,favor
217,liberty,freedom,emancipation


In [4]:
seeds_data_2.sample(5)

Unnamed: 0,original seed words,expanded seed words
80,restrict,stretch
44,allow,refuse
13,liberty,refusal
77,restrict,open
78,restrict,permit


In [5]:
seeds_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 632 entries, 0 to 631
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   original seed word  632 non-null    object
 1   explaination        632 non-null    object
 2   expanded seed word  631 non-null    object
dtypes: object(3)
memory usage: 14.9+ KB


In [6]:
seeds_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   original seed words  117 non-null    object
 1   expanded seed words  117 non-null    object
dtypes: object(2)
memory usage: 2.0+ KB


The cell above shows us that we have around 632 rows in our first sheet `Synoym` and 117 row in the second one `Ant`, which means that we have approximetly 749 terms for each column, basically the `original seed word` and the `expanded seed word`.
Let's. combine them into one list and start the NLP process of generating similar ones.

In [7]:
seeds = set(seeds_data_1["original seed word"].tolist() + seeds_data_2["original seed words"].tolist() )

In [8]:
# we used the set function in order to deduplicate the seed terms from the two lists and keep only unique ones
seeds = list(seeds)
# clean the list from nan
clean_seeds = [x for x in seeds if str(x) != 'nan']

In [9]:
print(f"we have exactly {len(clean_seeds)} seed terms in our list")

we have exactly 16 seed terms in our list


The idea now is to loop over each seed term and get the most n similar word from the word embedding space and set it in a dictionary where the key is the seed term and the values is a list of the feeding words.

In [10]:
def get_similar_words(similar_words):
  return list(set([elm[0].lower().replace('_', ' ') for elm in similar_words]))

In [11]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-100")  # download the model and return as object ready for use



In [12]:
# initialize a word dictionary
data_dictionary_glove = {}

# here we set how many words we want per seed term
n = 50

non_existent_seed = 0
for seed in clean_seeds : 
  try:
    data_dictionary_glove[seed] = get_similar_words(model.most_similar(seed, topn=n))
  except:
    non_existent_seed += 1
    print(f"the word {seed} doesn't exist in the word embedding space")

the word allow  doesn't exist in the word embedding space
the word Choice doesn't exist in the word embedding space
the word restrict  doesn't exist in the word embedding space
the word Interfere doesn't exist in the word embedding space
the word oppression  doesn't exist in the word embedding space


In [13]:
print(f"there is {non_existent_seed} seed terms that doesn't exist in the word embedding space")

there is 5 seed terms that doesn't exist in the word embedding space


We can clearly see that the approach with Glove word embedding is reaching out more seed terms than the Google word2vec model

## Save Results
In order to save results we will put them into a json file so that the structure will be more conveninet and ther will be no need to repeath the same seed words at each line. This way every seed term is a key and their values is a list with the top 20 terms that are semantically closer.

In [14]:
## Save results for the 2nd approach ## the better one
with open('seeds_most_similar_with_Glove.json', 'w') as fp:
    json.dump(data_dictionary_glove, fp, sort_keys=True, indent=4)

## Save results as CSV

In [22]:
global_d = []
for key, values in data_dictionary_glove.items():
  # key is the seed term and values is the list of nearest words to the seed term
  d = []
  for val in values:
    d.append( (key, val) )
  global_d.append(d)

In [27]:
flat_list = [item for sublist in global_d for item in sublist]

In [30]:
pd.DataFrame(flat_list, columns=('Original_seed_word', 'Expanded_seed_word')).to_csv("output.csv", index=None)