### Word Vectors with word2vec

Before running this worksheet, make sure to specify VECTORS_DIRECTORY, which is where your (binary-format) Google word vectors are stored. These vectors can be downloaded from https://code.google.com/p/word2vec/. 

You may also need to ```pip install gensim``` on your machine, if you don't currently have the ```gensim``` package.

In [1]:
from gensim.models.word2vec import Word2Vec
import random
VECTORS_DIRECTORY = '/Users/hopkinsm/Projects/thirdparty/word2vec/trunk/'

Next, load your word2vec model into memory. Choose either the GoogleNewsModel or the FreebaseModel (both will take a few minutes to load, but the GoogleNewsModel is faster).

In [2]:
from gensim.models.keyedvectors import KeyedVectors

class GoogleNewsModel:
    def __init__(self, vectors_dir):
        self.model = KeyedVectors.load_word2vec_format(vectors_dir + "/GoogleNews-vectors-negative300.bin", binary=True)
 
    def format_word(self, word):
        return '_'.join(word.strip().split())

model = GoogleNewsModel(VECTORS_DIRECTORY)


Now we define some functions for deadling with word vectors.

In [3]:
def get_word_vectors(model, words, out_file):
    out_handle = open(out_file, 'w')
    word_not_found_counter = 0
    for word in words:
        formatted_word = '_'.join(word.strip().split())
        if formatted_word in model.model:
            out_handle.write(word)
            out_handle.write('\t')   
            out_handle.write('\t'.join([str(num) for num in model.model[formatted_word]]))
            out_handle.write('\n')
        else:            
            print("WARNING: no vector found for " + unicode(word).encode('utf8'))
            word_not_found_counter += 1
    out_handle.close()
    print(str(word_not_found_counter) + " out of " + str(len(words)) + " words not found.")

def get_word_vector(model, word):
    formatted_word = format_word(model, word)
    if formatted_word in model.model:
        return model.model[formatted_word]
    else:
        return 'nope'

def format_word(model, word):
    formatted_word = model.format_word(word)
    if formatted_word in model.model:
        return formatted_word
    elif formatted_word.lower() in model.model:
        return formatted_word.lower()
    else:
        alt_format = ''.join(word.lower().split())
        if alt_format in model.model:
            return alt_format
        else:
            return formatted_word
        
def compute_similarity(model, word1, word2):
    return model.model.similarity(format_word(model, word1), format_word(model, word2))
   
def analogy(A, isto, as_):
    result = model.model.most_similar_cosmul(positive=[isto, as_], negative=[A])
    return result

Try out some similarities!

In [4]:
print(compute_similarity(model, 'king', 'queen'))
print(compute_similarity(model, 'king', 'chess'))
print(compute_similarity(model, 'king', 'zebra'))
print(compute_similarity(model, 'hot', 'cold'))

0.6510957
0.17920457
0.09835236
0.46021387


Try out some analogies!

In [5]:
analogy('zebra', isto='zoo', as_='dolphin')

[('aquarium', 0.9578245878219604),
 ('Aquarium', 0.9497913718223572),
 ('dolphins', 0.9197266697883606),
 ('Vancouver_Aquarium', 0.9036914110183716),
 ('Oceanarium', 0.8904195427894592),
 ('Zoo', 0.8896881341934204),
 ('Seaworld', 0.8866909742355347),
 ('dolphinariums', 0.8841013312339783),
 ('whale_shark', 0.8743610382080078),
 ('stillborn_calf', 0.8725596070289612)]

In [6]:
analogy('man', isto='woman', as_='king')

[('queen', 0.9314123392105103),
 ('monarch', 0.858533501625061),
 ('princess', 0.8476566076278687),
 ('Queen_Consort', 0.8150269985198975),
 ('queens', 0.8099815249443054),
 ('crown_prince', 0.8089975714683533),
 ('royal_palace', 0.8027306795120239),
 ('monarchy', 0.801961362361908),
 ('prince', 0.800979733467102),
 ('empress', 0.7958387136459351)]

In [7]:
analogy('man', isto='woman', as_='kanye')

[('beyonce', 0.9388467073440552),
 ('britney', 0.9098939299583435),
 ('chris_brown', 0.9000171422958374),
 ('rihanna', 0.8962053060531616),
 ('miley', 0.8951844573020935),
 ('lady_gaga', 0.8941847085952759),
 ('selena', 0.8923880457878113),
 ('heidi', 0.8865398168563843),
 ('miley_cyrus', 0.8859405517578125),
 ('shes', 0.8803869485855103)]

In [8]:
analogy('woman', isto='man', as_='man')

[('guy', 0.9004976153373718),
 ('boy', 0.8859462141990662),
 ('Man', 0.8823252320289612),
 ('teenager', 0.8430608510971069),
 ('lad', 0.8414479494094849),
 ('suspected_purse_snatcher', 0.8371303081512451),
 ('robber', 0.8346849679946899),
 ('fella', 0.8342578411102295),
 ('dude', 0.832249104976654),
 ('chap', 0.8250261545181274)]

In [9]:
analogy('man', isto='woman', as_='Brad_Pitt')

[('Angelina_Jolie', 0.9859267473220825),
 ('Jolie', 0.9671158194541931),
 ('Angeline_Jolie', 0.9343865513801575),
 ('actress_Angelina_Jolie', 0.9301474094390869),
 ('Jennifer_Aniston', 0.9110976457595825),
 ('Julia_Roberts', 0.9050232768058777),
 ('Nicole_Kidman', 0.8969831466674805),
 ('Brangelina', 0.8947701454162598),
 ('hubby_Brad_Pitt', 0.894501805305481),
 ('Halle_Berry', 0.8915920257568359)]

In [10]:
analogy('man', isto='woman', as_='obama')

[('hillary', 0.8977577686309814),
 ('palin', 0.8898450136184692),
 ('sarah_palin', 0.8604124188423157),
 ('oprah', 0.8574645519256592),
 ('michelle_obama', 0.8549109101295471),
 ('hillary_clinton', 0.8539676070213318),
 ('clinton', 0.8533428311347961),
 ('mccain', 0.8518882989883423),
 ('barack_obama', 0.8460192680358887),
 ('miley_cyrus', 0.8413234949111938)]

In [11]:
analogy('Rome', isto="Italy", as_='Paris')

[('France', 0.9915590286254883),
 ('Belgium', 0.8827090859413147),
 ('Germany', 0.8676550984382629),
 ('Switzerland', 0.8594647645950317),
 ('Villebon_Sur_Yvette', 0.8542975783348083),
 ('extradites_Noriega', 0.8502576947212219),
 ('Tourcoing', 0.8431874513626099),
 ('PARIS_AFX_Gaz_de', 0.8378893733024597),
 ('Dordogne_region', 0.8368163108825684),
 ('Morocco', 0.8367656469345093)]

In [12]:
analogy('Harvard', isto='MIT', as_='Stanford')

[('Caltech', 0.8554999828338623),
 ('UC_Santa_Barbara', 0.8236735463142395),
 ('UCLA', 0.8204423189163208),
 ('Laboratory_SSRL', 0.8178907036781311),
 ('IBM_Almaden', 0.8176031112670898),
 ('SSE_Labs', 0.8147667050361633),
 ('UCSD', 0.8145511746406555),
 ('Karl_Deisseroth', 0.8084021806716919),
 ('Stanford_Synchrotron_Radiation', 0.8060752749443054),
 ('Zhenan_Bao', 0.8060031533241272)]

In [13]:
analogy('hot', isto='cold', as_='warm')

[('chilly', 0.9938065409660339),
 ('warmth', 0.95757657289505),
 ('bitterly_cold', 0.9518202543258667),
 ('colder_temperatures', 0.939226508140564),
 ('warmer', 0.9386539459228516),
 ('chill', 0.938201367855072),
 ('colder', 0.9306686520576477),
 ('wintry_chill', 0.9274750351905823),
 ('Warm', 0.9267772436141968),
 ('chilly_temperatures', 0.9265519976615906)]

In [14]:
analogy('hot', isto='hotter', as_='warm')

[('warmer', 1.0093555450439453),
 ('colder', 0.9644930958747864),
 ('chillier', 0.9083869457244873),
 ('drier', 0.902718186378479),
 ('wetter', 0.895614743232727),
 ('degrees_warmer', 0.8833713531494141),
 ('warmest', 0.8638020753860474),
 ('colder_temperatures', 0.8616779446601868),
 ('cooler_temperatures', 0.861514687538147),
 ('warmer_temperatures', 0.8596215844154358)]

In [15]:
analogy('jump', isto='jumped', as_='go')

[('went', 0.9635381102561951),
 ('gone', 0.9138423800468445),
 ('walked', 0.8547773361206055),
 ('moved', 0.8460192680358887),
 ('goes', 0.8437943458557129),
 ('came', 0.8318214416503906),
 ('stayed', 0.8282606601715088),
 ('ventured', 0.8271411061286926),
 ('ran', 0.8173441886901855),
 ('come', 0.8151555061340332)]

In [16]:
analogy('days', isto='month', as_='years')

[('decade', 0.9610583186149597),
 ('year', 0.95908123254776),
 ('months', 0.8449519872665405),
 ('decades', 0.8329842686653137),
 ('week', 0.8274986147880554),
 ('years.The', 0.7957645058631897),
 ('##/#-year', 0.7813944816589355),
 ('twoyears', 0.7800780534744263),
 ('since####', 0.7741715908050537),
 ('recently', 0.7714582085609436)]