# Word2vec model
This is the model from http://vectors.nlpl.eu/repository/#

Need to cite the following paper:  
Fares, Murhaf; Kutuzov, Andrei; Oepen, Stephan & Velldal, Erik (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources, In Jörg Tiedemann (ed.), Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017. Linköping University Electronic Press. ISBN 978-91-7685-601-7

Detailed info about this model is in w2v_data/3/meta.json but to summerize the main points:

- Lemmatizatin is True, i.e. words are catogrized into their different types, e.g. noun, verb, etc. So some examples of word in the vocabulary are 'man_NOUN', 'Aly_PROPN'.
- vocabulary size is 296630, and dimension is 300
- Gensim Continuous Skipgram, window size 5
- English Wikipedia Dump of February 2017

And you might find the following commands very useful:
```
wiki_w2v.index2word
wiki_w2v.word_vec
wiki_w2v.similarity
wiki_w2v.similar_by_word
wiki_w2v.similar_by_vector
```

----

In [6]:
import gensim.models as w2v       # Word2Vec Library
import pandas as pd
import sys
import numpy as np

In [5]:
%pwd

'/Users/mehrdadalvandipour/MyDir/Rule-Mining-on-Embedding-Enriched-KBs'

In [7]:
wiki_w2v = w2v.KeyedVectors.load_word2vec_format('./w2v_data/3/model.bin', binary=True)

In [5]:
type(wiki_w2v)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [11]:
# This chunk finds all types of words in the model
voc = wiki_w2v.index2entity
types = set()

types

for x in voc:
    ind = x.find('_')
    types.add(x[ind+1:])
    
    

types

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SYM',
 'VERB',
 'X'}

In [10]:
def find_ex_type(t):
    # Assumes variable 'voc' exists which is a list of words in the model
    # Finds exampels from vocabulary that are from the type 't'.
    # 
    col = [];
    t= t.upper()
    if (t in types):
        for x in voc:
            if x.find('_'+ t) > -1:
                col.append(x)
    else:
        print(t + "does not exist in types")
        print(types)
        
    return col

In [12]:
find_ex_type('ADV') # remove ; to see

['also_ADV',
 'later_ADV',
 'however_ADV',
 'well_ADV',
 'still_ADV',
 'back_ADV',
 'often_ADV',
 'even_ADV',
 'together_ADV',
 'currently_ADV',
 'first_ADV',
 'originally_ADV',
 'never_ADV',
 'eventually_ADV',
 'usually_ADV',
 'thus_ADV',
 'away_ADV',
 'instead_ADV',
 'soon_ADV',
 'rather_ADV',
 'almost_ADV',
 'approximately_ADV',
 'sometimes_ADV',
 'much_ADV',
 'previously_ADV',
 'especially_ADV',
 'generally_ADV',
 'ever_ADV',
 'initially_ADV',
 'finally_ADV',
 'long_ADV',
 'already_ADV',
 'prior_ADV',
 'far_ADV',
 'subsequently_ADV',
 'particularly_ADV',
 'mostly_ADV',
 'always_ADV',
 'south_ADV',
 'therefore_ADV',
 'nearly_ADV',
 'mainly_ADV',
 'recently_ADV',
 'yet_ADV',
 'primarily_ADV',
 'directly_ADV',
 'highly_ADV',
 'immediately_ADV',
 'shortly_ADV',
 'quickly_ADV',
 'formerly_ADV',
 'typically_ADV',
 'commonly_ADV',
 'probably_ADV',
 'officially_ADV',
 'actually_ADV',
 'widely_ADV',
 'alone_ADV',
 'largely_ADV',
 'less_ADV',
 'respectively_ADV',
 'early_ADV',
 'simply_ADV',

In [283]:
#wiki_w2v.word_vec('Dominican::Republic_PROPN') 
#wiki_w2v.similarity(w1,w2)
'Film_PROPN' in voc
wiki_w2v.similar_by_word('Film_PROPN')

[('Documentary_PROPN', 0.7497795820236206),
 ('Festival_PROPN', 0.7208018898963928),
 ('FIPRESCI_PROPN', 0.7063645124435425),
 ('Audience_PROPN', 0.6939208507537842),
 ('Outfest_PROPN', 0.6913202404975891),
 ('Cinequest_PROPN', 0.6887915730476379),
 ('Seattle::International::Film::Festival_PROPN', 0.6837893724441528),
 ('Fantasporto_PROPN', 0.6767797470092773),
 ('Cinema_PROPN', 0.6761119365692139),
 ('Sundance_PROPN', 0.6749467849731445)]

Since words in the gensim model are in the form 'word_LEMM', and in FB15 they are in 'm/id' form, to compute similarity we need to convert m/id to 'word_LEMM'.  
The file `mid2name.tsv` provides mapping from 'm/id' to 'word'. Apparently most words in FB15 are proper nouns, so we are going to convert  each m/id to 'word_PROPN'.

To this end, we simply add '\_PROPN' to every word in `mid2name.tsv`. Blank spaces should also be replaced with '::'. The function `convert_to_wiki_form` makes these changes for each given input word. 

Note that there are many duplicates in `mid2name.tsv`. The data is loaded into a data frame and the duplicates are dropped, keeping only the first instance for each mid. Then a dictionary is created mapping mids to word_LEMMs. 

Using this dictionary we can later map every entity `ent_i` to a lemmatized word `ent_LEMM_i` and then compare it with another by `wiki_w2v.similarity(ent_lemm_i,ent_lemm_j)`. We also need a threshold to compare this quantity with.

In [92]:
df = pd.read_csv('./FB15K/mid2name.tsv', sep='\t', header=None, names=['mid','word']) 

In [240]:
df.loc[df['mid'] == '/m/027rn']  # It has many duplicates

Unnamed: 0,mid,word
3403,/m/027rn,Dominican Republic


In [94]:
# dropping ALL duplicte values 
df.drop_duplicates(subset ="mid", 
                     keep = 'first', inplace = True) 


df.loc[df['mid'] == '/m/027rn']


Unnamed: 0,mid,word
3403,/m/027rn,Dominican Republic


In [None]:
def convert_to_wiki_form(ent: str):
    ent = ent.replace(' ', '::')
    ent = ent + '_PROPN'
    return ent

In [95]:
#file.head()
keys = list(df['mid'])
mapping = list(df['word'])
mapp_lemm = [convert_to_wiki_form(str(w)) for w in mapping] # words lemmatized

In [112]:
mapping_dict = dict(zip(keys,mapp_lemm))

In [113]:
mapping_dict['/m/027rn']

'Dominican::Republic_PROPN'

--------

## Read entity2id.txt and create the similarity matrix/data frame. 

First we read "/OpenKE/benchmarks/FB15K/entity2id.txt" into a data frame.

In [124]:

ents = pd.read_csv("./OpenKE/benchmarks/FB15K/entity2id.txt",
                   sep = '\t',header=None, names=['mid'], skiprows=[0],usecols=[0]) # first row is lineTot


In [292]:
mid = ents.iloc[3]['mid']

In [140]:
ent_lemm = mapping_dict[mid]

In [141]:
ent_lemm in voc

False

Apparently some words are not found in w2v vocabulary. So we look into it:

In [267]:
in_w2v =0
not_in_dict = 0
subset = []      # The subset of words from FB15 that are in w2v.
not_in_w2v = []  # And the ones that aren't.
for i in range(len(ents)):
    mid = ents.iloc[i]['mid']
    try:
        ent_lemm = mapping_dict[mid] 
        if ent_lemm in voc:
            in_w2v +=1
            subset.append( (i,mid,ent_lemm) )
        else:
            not_in_w2v.append((i,mid,ent_lemm))
            
    except KeyError: # These words are not even in mid2name.tsv. The mids don't exist.
        not_in_dict += 1

In [152]:
not_in_dict   # This many are no found in mid2name.tsv

185

Not that many, so it's cool.

In [223]:
in_w2v        # This many were found in word2vec vocabulary

7088

That's less than half of FB15K. Not cool, but let's move on:

In [157]:
subset[:3] # We have collected them all in a list. (id, mid, word_lemm). id -> entity2id.txt

[(0, '/m/027rn', 'Dominican::Republic_PROPN'),
 (1, '/m/06cx9', 'Republic_PROPN'),
 (6, '/m/01sl1q', 'Michelle::Rodriguez_PROPN')]

In [268]:
not_in_w2v # a peak into the words that are not in w2v. at least not in this format.

[(2, '/m/017dcd', 'Mighty::Morphin::Power::Rangers_PROPN'),
 (3, '/m/06v8s0', 'Wendee::Lee_PROPN'),
 (4, '/m/07s9rl0', 'Drama::film_PROPN'),
 (5, '/m/0170z3', 'American::History::X_PROPN'),
 (8, '/m/0cnk2q', 'Australia::national::association::football::team_PROPN'),
 (10, '/m/02_j1w', 'Defender::(association::football)_PROPN'),
 (13, '/m/03h_f4', '34th::Canadian::Parliament_PROPN'),
 (14, '/m/011yn5', 'As::Good::as::It::Gets_PROPN'),
 (16, '/m/04nrcg', 'Maldives::national::football::team_PROPN'),
 (17, '/m/02sdk9v', 'Forward::(association::football)_PROPN'),
 (19, '/m/014lc_', 'Star::Trek::Nemesis_PROPN'),
 (20, '/m/05cvgl', 'The::Remains::of::the::Day::(film)_PROPN'),
 (21,
  '/m/04kxsb',
  'BAFTA::Award::for::Best::Actor::in::a::Leading::Role_PROPN'),
 (22, '/m/02qyp19', 'BAFTA::Award::for::Best::Original::Screenplay_PROPN'),
 (23, '/m/02d413', 'Philadelphia::(film)_PROPN'),
 (28, '/m/09w1n', 'Alpine::skiing_PROPN'),
 (29, '/m/0sx8l', '1980::Winter::Olympics_PROPN'),
 (32, '/m/0gqng'

In [295]:
# Here we compute the similarity b/w all the words that were found in w2v vocab.
# There too many of these. To be exact, it is 7088 choose 2 = 25,116,328.
# So maybe you don't want to collect them all or interrupt the loop. That's why a 
# progress precentage is printed in the output so you can estimate how long it will
# take and keyboard interrupt is handled properly.
#
# Also you might want to consider forcing the threshold to filter out here instead of later.
# If so just set a non-zero value for threshold <1.
threshold = 0.5
h = [] # heads
t = [] # tails
s = [] # score
not_checked = subset.copy()
c=0
tot = len(subset)
try:
    for w1 in subset:
        c += 1
        not_checked.remove(w1)
        for w2 in not_checked:
            cos_val = wiki_w2v.similarity(w1[2],w2[2])
            if cos_val > threshold:
                h.append(w1[1])
                t.append(w2[1])
                s.append(cos_val) 
        print(f'{100*c/tot:.3f} %\r', end="") # Print progress %
except KeyboardInterrupt: # Take care of it bc sometimes {h,t,s}lenghts do not match.
    print('KeyboardInterrupt at ' + f'{100*c/tot:.3f} %')
    if len(h) > len(t):
        print("I popped!")
        h.pop()
    if len(s) < len(h):
        print("I append!")
        s.append(wiki_w2v.similarity(w1[2],w2[2])) 
    if not(len(s) == len(h) == len(t)):
        print("len s,h,t still don't match :(")
    print("h: " + str(len(h)) + " t: " + str(len(t)) + " s: " + str(len(s)) )
    

KeyboardInterrupt at 1.086 %
h: 22777 t: 22777 s: 22777


In [296]:
d = {'head': h, 'tail':t, 'score':s}
sim_df = pd.DataFrame(data=d)
sim_df

Unnamed: 0,head,tail,score
0,/m/027rn,/m/0160w,0.538880
1,/m/027rn,/m/0jgd,0.619635
2,/m/027rn,/m/0b90_r,0.676415
3,/m/027rn,/m/027nb,0.646344
4,/m/027rn,/m/0h3y,0.501404
...,...,...,...
22772,/m/01914,/m/0d05w3,0.680367
22773,/m/01914,/m/06f32,0.532584
22774,/m/01914,/m/03h64,0.529266
22775,/m/01914,/m/049d1,0.505844


In [297]:
# Filter out more [as you wish] and at the end pass the head and tail column 
# inside another df to the append module.
# For example:
filt = sim_df.loc[sim_df['score'] > .85].copy()
filt.drop(columns='score',inplace=True)
filt # pass this out

Unnamed: 0,head,tail
1380,/m/01cwm1,/m/01cwq9
5186,/m/07y_7,/m/05r5c
5217,/m/07y_7,/m/01xqw
12329,/m/08815,/m/01w5m
12331,/m/08815,/m/03ksy
12365,/m/08815,/m/05zl0
12498,/m/08815,/m/01bm_
15622,/m/0jgd,/m/07twz
