The meaning of words often changes over time.  In this homework, you will explore this phenomenon by identifying shifts in word meaning over the space of one hundred years by examining word embeddings trained on historical data (largely published before 1923) and those trained on contemporary texts.

In [1]:
import re
from gensim.models import KeyedVectors
import operator

In [2]:
wiki = KeyedVectors.load_word2vec_format("../data/glove.6B.50d.50K.txt", binary=False)

In [3]:
guten = KeyedVectors.load_word2vec_format("../data/gutenberg.200.vectors.50K.txt", binary=False)

Q1. Before we jump in, select 5 words whose senses you believe have changed over the period of the past 100 years. Ensure they are in the vocabulary of both models.  Explain the two different meanings they have.  This is an important step in stating your beliefs before you examine any empirical evidence; do not change these terms after you have run the models you develop below.  (Here we are only evaluating the rationales, not whether the terms *actually* undergo sense change, as measured below.)

In [4]:
# fill in terms here
terms=["apple", "tank", "mouse", "weed", "album"]
for term in terms:
    if term not in wiki or term not in guten:
        print("%s missing!" % term)

**Q1 response**.

apple: \
past - a fruit \
now - a company

tank: \
past - a container for liquid \
now - mobile military weapon

mouse: \
past - an animal \
now - a peripheral for a computer

weed: \
past - invaluable plants in the garden/farm \
now - cannabis

album: \
past - collection of history records \
now - collection of music or photos

Q2. Find the words that have changed the most by calculating the number of words that overlap in their 50 nearest neighbors.  That is, let $\mathcal{N}_{guten}(\textrm{awesome})$ be the 50 nearest neighbors for the word "awesome" in the Gutenberg embeddings and $\mathcal{N}_{wiki}(\textrm{awesome})$ be the 50 nearest neighbors for "awesome" in the Wikipedia embeddings.  Calculate the size of $\mathcal{N}_{guten}(\textrm{awesome}) \cap \mathcal{N}_{wiki}(\textrm{awesome})$.  Under this method, the words that share the *fewest* neighbors have moved the furthest apart.  Display the 100 words that have moved the furthest apart and the 100 words that have remained the closest together, along with their intersection score.  

Now let's look at how much the candidate terms you defined above have changed their meaning as measured in these embeddings.  First, we can just print their neighborhoods:

In [5]:
def print_top(word):
    print("=== %s ===\n" % word)
    print("Gutenberg:")
    for k, v in guten.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k))

    print()
    print("Wikipedia:")
    for k, v in wiki.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k)) 
    print()

In [6]:
for term in terms:
    print_top(term)

=== apple ===

Gutenberg:
0.686	fruit
0.679	apples
0.662	apricot
0.662	onion
0.661	pear
0.656	cabbage
0.656	cherry
0.656	peach
0.648	bread-fruit
0.639	gum

Wikipedia:
0.754	blackberry
0.744	chips
0.743	iphone
0.733	microsoft
0.733	ipad
0.722	pc
0.720	ipod
0.719	intel
0.715	ibm
0.709	software

=== tank ===

Gutenberg:
0.653	hydrant
0.632	cistern
0.625	basin
0.609	radiator
0.591	trough
0.579	pail
0.579	dribbling
0.578	water
0.577	buckets
0.577	refrigerator

Wikipedia:
0.846	tanks
0.747	rocket
0.737	artillery
0.724	fire
0.711	cannon
0.688	launcher
0.684	armored
0.682	armoured
0.676	munitions
0.675	mounted

=== mouse ===

Gutenberg:
0.701	kitten
0.694	cat
0.641	dog
0.589	bird
0.580	caterpillar
0.576	puppy
0.559	butterfly
0.549	hen
0.545	squirrel
0.537	dormouse

Wikipedia:
0.797	monkey
0.781	bugs
0.773	cat
0.762	rabbit
0.750	worm
0.731	clone
0.727	robot
0.720	spider
0.710	bug
0.703	frog

=== weed ===

Gutenberg:
0.598	furze
0.596	nettles
0.593	herb
0.591	grape
0.580	flower
0.579	juice
0.576

**Q2 response**.

In [7]:
def count_overlap(word):
    
    guten_list = [k for k,v in guten.most_similar(word, topn=50)]
    wiki_list = [k for k,v in wiki.most_similar(word, topn=50)]

    return len(set(guten_list) & set(wiki_list))

In [8]:
scores = []

for term in terms:
    scores.append(count_overlap(term))

print(scores)

[1, 2, 9, 5, 0]


In [9]:
common_words = list(set(list(guten.key_to_index.keys())) & set(list(wiki.key_to_index.keys())))

pattern = r'\d+'
common_words_without_numbers = [s for s in common_words if not re.search(pattern, s)]

len(common_words_without_numbers)

25603

In [10]:
scores = []

for word in common_words_without_numbers:
    scores.append(count_overlap(word))

In [11]:
import pandas as pd

data = {'word': common_words_without_numbers,
        'score': scores}

df = pd.DataFrame(data)

**closest 100 words:**

In [12]:
df.sort_values(by=["score"], ascending=True).head(100)

Unnamed: 0,word,score
25602,rehabilitation,0
7234,outlandish,0
18212,caledonian,0
7231,missy,0
18215,wetter,0
...,...,...
18422,wellington,0
18428,driveway,0
7036,religiously,0
7032,fitch,0


**farthest 100 words:**

In [13]:
df.sort_values(by=["score"], ascending=False).head(100)

Unnamed: 0,word,score
18764,fifteen,31
10781,fourteen,30
7705,kentucky,29
1667,iowa,29
8648,thirteenth,29
...,...,...
19676,tenth,19
10768,pie,19
10247,pennsylvania,19
2977,s.,19


**check+**. Let's make this a little more precise.  Rank all terms by the overlap score you created above, so that words with scores closer to 0 (i.e., no overlap in nearest neighbors) are ranked higher (i.e., closer to position 1). Measure how good your guesses were by calculating their [mean reciprocal rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) within this list.  (Again, we're not evaluating how good your guesses were above, but rather the correctness of your implementation of MRR.)

In [14]:
df_rank = df.sort_values(by=["score"], ascending=True)
df_rank["rank"] = list(range(1, len(df_rank)+1))
df_rank["rr"] = 1 / df_rank["rank"]
df_rank

Unnamed: 0,word,score,rank,rr
25602,rehabilitation,0,1,1.000000
7234,outlandish,0,2,0.500000
18212,caledonian,0,3,0.333333
7231,missy,0,4,0.250000
18215,wetter,0,5,0.200000
...,...,...,...,...
7705,kentucky,29,25599,0.000039
8648,thirteenth,29,25600,0.000039
1667,iowa,29,25601,0.000039
10781,fourteen,30,25602,0.000039


In [15]:
df_select = df_rank[(df_rank["word"] == "apple") | (df_rank["word"] == "tank") | (df_rank["word"] == "mouse") | (df_rank["word"] == "weed") | (df_rank["word"] == "album")]
df_select

Unnamed: 0,word,score,rank,rr
11737,album,0,3141,0.000318
22595,apple,1,10397,9.6e-05
7855,tank,2,13760,7.3e-05
5637,weed,5,20038,5e-05
21172,mouse,9,23793,4.2e-05


In [16]:
MRR = df_select["rr"].sum()/5
MRR

0.00011583206074514676