# Test wiki embeddings
In which we test the effectiveness of embeddings trained on Wikipedia "words, entities and concepts", i.e. ConVec (described [here](https://github.com/ehsansherkat/ConVec) and in [Sherkat and Milios, 2017](https://arxiv.org/pdf/1702.03470.pdf)).

In [1]:
import pandas as pd

## Load data

In [97]:
wiki_embeddings = pd.read_csv('/hg190/corpora/wiki_embeddings/ConVec/WikipediaClean5Negative300Skip10.txt', sep=' ', 
                              index_col=0, header=None, skiprows=1)
# drop nan index??
print(wiki_embeddings.drop(pd.np.nan, axis=0, inplace=True))

None


## Test neighbors

Do we find that entities have reasonable nearest neighbors? E.g., "Obama" could be similar to "Clinton" or "Bush" (depending on who's writing the articles).

In [20]:
from scipy.spatial.distance import cdist
def get_neighbors(word, embeddings, top_k=10, dist='cosine'):
    # compute similarities
    w_vec = embeddings.loc[word, :].values.reshape(1,-1)
    word_sims = pd.Series(cdist(embeddings, w_vec, dist).reshape(-1), index=embeddings.index)
    word_sims.sort_values(inplace=True, ascending=True)
    word_sims = word_sims[1:top_k+1]
    return word_sims

In [24]:
test_words = ['obama', 'tomato', 'water', 'atlanta', 'georgia', 'usa']
top_k = 10
for w in test_words:
    neighbors = get_neighbors(w, wiki_embeddings, top_k=top_k)
    print('top %d neighbors of %s: %s'%(top_k, w, ','.join(neighbors.index)))

top 10 neighbors of obama: 534366,barack,obamas,3414021,3356,biden,20082093,145422,5043192,5122699
top 10 neighbors of tomato: potato,eggplant,onions,9940234,carrots,garlic,spinach,lettuce,potatoes,onion
top 10 neighbors of water: seawater,potable,groundwater,seepage,198725,wastewater,262927,sewage,waters,desalinated
top 10 neighbors of atlanta: 3138,memphis,nashville,peachtree,dallas,miami,houston,knoxville,jacksonville,louisville
top 10 neighbors of georgia: tennessee,alabama,georgias,florida,carolina,virginia,kentucky,arkansas,8733443,atlanta
top 10 neighbors of usa: america,cinematexas,iscp,canada,united,germany,41194889,goteberg,cologneoff,miniprint


Most of these make sense: `Atlanta` has other capital cities (of Southern states) as neighbors.

There's a lot of numbers in here! I guess they are IDs for specific entities.

In [44]:
import codecs
id_title_file = '/hg190/corpora/wiki_embeddings/ConVec/id_title_map.csv'
# get rid of first column delimited by colon for some reason
# and convert second column ID to integer
id_title_lookup = [l.strip().split(',') for l in codecs.open(id_title_file, 'r', encoding='utf-8') if len(l.strip().split(',')) >= 2]
id_title_lookup = [(int(i[0].split(':')[1]), ','.join(i[1:])) for i in id_title_lookup]
id_title_lookup = dict(id_title_lookup)
id_title_lookup = pd.Series(id_title_lookup)
print('%d total entities'%(len(id_title_lookup)))
print(id_title_lookup.head())

16527332 total entities
10     AccessibleComputing
12               Anarchism
13      AfghanistanHistory
14    AfghanistanGeography
15       AfghanistanPeople
dtype: object


Let's replace all the number IDs in the embeddings with their entity equivalent.

In [108]:
from itertools import izip
updated_idx = []
entity_suffix = '_ENTITY'
for i in wiki_embeddings.index:
    if(i.isdigit()):
        try:
            i_int = int(i)
            if(i_int in id_title_lookup.index):
                new_idx = id_title_lookup.loc[i_int]
                # add suffix to indicate entity
                new_idx += entity_suffix
                updated_idx.append(new_idx)
            else:
                updated_idx.append(i)
        # exception because pandas can't handle long indices
        except Exception, e:
            updated_idx.append(i)
    else:
        updated_idx.append(i)

And do the same test as before.

In [110]:
wiki_embeddings_full = wiki_embeddings.copy()
wiki_embeddings_full.index = updated_idx
print(wiki_embeddings_full.index.sort_values()[:10])

Index([u'! (The Dismemberment Plan album)_ENTITY', u'!!!_ENTITY',
       u'!Action Pact!_ENTITY', u'!Hero (album)_ENTITY', u'!Hero_ENTITY',
       u'!Kung language_ENTITY', u'!Oka Tokat_ENTITY', u'!T.O.O.H.!_ENTITY',
       u'!WOWOW!_ENTITY', u'!Women Art Revolution_ENTITY'],
      dtype='object')


In [112]:
test_words = ['usa', 'obama', 'tomato', 'water', 'Atlanta_ENTITY', 'georgia']
top_k = 10
for w in test_words:
    neighbors = get_neighbors(w, wiki_embeddings_full, top_k=top_k)
    print('top %d neighbors of %s: %s'%(top_k, w, ';'.join(neighbors.index)))

top 10 neighbors of usa: america;cinematexas;iscp;canada;united;germany;Art Palm Beach_ENTITY;goteberg;cologneoff;miniprint
top 10 neighbors of obama: Barack Obama_ENTITY;barack;obamas;George W. Bush_ENTITY;Bill Clinton_ENTITY;biden;Presidency of Barack Obama_ENTITY;Joe Biden_ENTITY;Hillary Clinton_ENTITY;John Kerry_ENTITY
top 10 neighbors of tomato: potato;eggplant;onions;Tomato_ENTITY;carrots;garlic;spinach;lettuce;potatoes;onion
top 10 neighbors of water: seawater;potable;groundwater;seepage;Drinking water_ENTITY;wastewater;Groundwater_ENTITY;sewage;waters;desalinated
top 10 neighbors of Atlanta_ENTITY: Decatur, Georgia_ENTITY;atlanta;Marietta, Georgia_ENTITY;Savannah, Georgia_ENTITY;Macon, Georgia_ENTITY;Nashville, Tennessee_ENTITY;Memphis, Tennessee_ENTITY;Charlotte, North Carolina_ENTITY;Birmingham, Alabama_ENTITY;Georgia (U.S. state)_ENTITY
top 10 neighbors of georgia: tennessee;alabama;georgias;florida;carolina;virginia;kentucky;arkansas;Herty Field_ENTITY;atlanta


Better! Although we still see some overlap in names due to capitalization errors (`atlanta` vs. `Atlanta`).

Let's save these updated embeddings for later use.

In [125]:
import gzip
from itertools import izip
wiki_full_file_name = '/hg190/corpora/wiki_embeddings/ConVec/WikipediaClean5Negative300Skip10_withentities.gz'
with gzip.open(wiki_full_file_name, 'w') as wiki_full_file:
    for i, r in wiki_embeddings_full.iterrows():
        r_combined = [i] + map(str, r.values.tolist())
        r_combined_str = '\t'.join(r_combined).encode('utf-8')
        wiki_full_file.write('%s\n'%(r_combined_str))