# Nearest Neighbors

When exploring a large set of documents -- such as Wikipedia, news articles, StackOverflow, etc. -- it can be useful to get a list of related material. To find relevant documents you typically
* Decide on a notion of similarity
* Find the documents that are most similar 

In the assignment you will
* Gain intuition for different notions of similarity and practice finding similar documents. 
* Explore the tradeoffs with representing documents using raw word counts and TF-IDF
* Explore the behavior of different distance metrics by looking at the Wikipedia pages most similar to President Obama’s page.

**Note to Amazon EC2 users**: To conserve memory, make sure to stop all the other notebooks before running this notebook.

## Import necessary packages

In [146]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.sparse import csr_matrix      # sparse matrices
%matplotlib inline

## Load Wikipedia dataset

We will be using the same dataset of Wikipedia pages that we used in the Machine Learning Foundations course (Course 1). Each element of the dataset consists of a link to the wikipedia article, the name of the person, and the text of the article (in lowercase).  

In [147]:
wiki = pd.read_csv('people_wiki.csv')

In [148]:
wiki.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


## Extract word count vectors

For your convenience, we extracted the word count vectors from the dataset. The vectors are packaged in a sparse matrix, where the i-th row gives the word count vectors for the i-th document. Each column corresponds to a unique word appearing in the dataset. The mapping between words and integer indices are given in people_wiki_map_index_to_word.gl.

To load in the word count vectors, define the function

In [149]:
def load_sparse_csr(filename):
    loader = np.load(filename)
    data = loader['data']
    indices = loader['indices']
    indptr = loader['indptr']
    shape = loader['shape']
    
    return csr_matrix( (data, indices, indptr), shape)

In [150]:
word_count = load_sparse_csr('people_wiki_word_count.npz')

In [151]:
print(word_count[0])

  (0, 5877)	1
  (0, 92219)	1
  (0, 227191)	1
  (0, 446948)	1
  (0, 468870)	1
  (0, 477285)	5
  (0, 492466)	1
  (0, 509506)	1
  (0, 514262)	1
  (0, 523996)	1
  (0, 528953)	1
  (0, 529843)	1
  (0, 533540)	1
  (0, 535034)	3
  (0, 535475)	1
  (0, 538022)	1
  (0, 538168)	1
  (0, 540827)	1
  (0, 541501)	1
  (0, 541760)	1
  (0, 542488)	1
  (0, 542854)	1
  (0, 542859)	1
  (0, 542919)	1
  (0, 543517)	2
  :	:
  (0, 547931)	1
  (0, 547934)	1
  (0, 547935)	1
  (0, 547938)	1
  (0, 547952)	1
  (0, 547956)	1
  (0, 547958)	1
  (0, 547959)	1
  (0, 547960)	1
  (0, 547962)	2
  (0, 547963)	1
  (0, 547964)	3
  (0, 547965)	4
  (0, 547966)	6
  (0, 547967)	5
  (0, 547969)	2
  (0, 547970)	5
  (0, 547971)	4
  (0, 547972)	5
  (0, 547973)	1
  (0, 547974)	4
  (0, 547975)	4
  (0, 547976)	13
  (0, 547977)	4
  (0, 547978)	27


In [152]:
word_count[0].indices

array([  5877,  92219, 227191, 446948, 468870, 477285, 492466, 509506,
       514262, 523996, 528953, 529843, 533540, 535034, 535475, 538022,
       538168, 540827, 541501, 541760, 542488, 542854, 542859, 542919,
       543517, 543802, 544119, 544367, 544602, 544982, 545219, 545515,
       545540, 545588, 545715, 545920, 546322, 546354, 546370, 546421,
       546503, 546518, 546570, 546634, 546639, 546696, 546703, 546719,
       546752, 546775, 546778, 546874, 546949, 547087, 547101, 547194,
       547210, 547260, 547261, 547359, 547478, 547492, 547498, 547536,
       547541, 547550, 547579, 547580, 547628, 547630, 547651, 547662,
       547667, 547674, 547687, 547689, 547705, 547708, 547731, 547745,
       547751, 547759, 547771, 547778, 547780, 547781, 547798, 547808,
       547809, 547825, 547837, 547839, 547843, 547844, 547849, 547856,
       547859, 547860, 547867, 547869, 547874, 547879, 547882, 547887,
       547889, 547899, 547901, 547904, 547910, 547913, 547914, 547916,
      

In [153]:
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', typ='series')

## Find nearest neighbors

Let's start by finding the nearest neighbors of the Barack Obama page using the word count vectors to represent the articles and Euclidean distance to measure distance.  For this, we will use scikit-learn's implementation of k-nearest neighbors. We first create an instance of the NearestNeighbor class, specifying the model parameters. Then we call the fit() method to attach the training set.

In [154]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(metric='euclidean', algorithm='brute')
model.fit(word_count)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

Run the following cell to obtain the row number for Obama's article:

In [155]:
print (wiki[wiki['name'] == 'Barack Obama'])

                                              URI          name  \
35817  <http://dbpedia.org/resource/Barack_Obama>  Barack Obama   

                                                    text  
35817  barack hussein obama ii brk husen bm born augu...  


Let us run the k-nearest neighbor algorithm with Obama's article. Since the NearestNeighbor class expects a vector, we pass the 35817th row of word_count vector.

In [156]:
distances, indices = model.kneighbors(word_count[35817], n_neighbors=10) # 1st arg: word count vector

The query returns the indices of and distances to the 10 nearest neighbors. To display the indices and distances together with the article name, run

In [157]:
neighbors = pd.DataFrame({'distance':distances[0].tolist(), 'id':indices[0].tolist()})

In [158]:
neighbors.head(2)

Unnamed: 0,distance,id
0,0.0,35817
1,33.075671,24478


In [159]:
wiki.head(2)

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...


In [160]:
wiki['id'] = wiki.index
wiki.head(2)

Unnamed: 0,URI,name,text,id
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,0
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,1


In [161]:
print (pd.merge(wiki, neighbors, on='id', how='inner').sort_values(by=['distance'])[['id','name','distance']])

      id                        name   distance
8  35817                Barack Obama   0.000000
4  24478                   Joe Biden  33.075671
5  28447              George W. Bush  34.394767
7  35357            Lawrence Summers  36.152455
2  14754                 Mitt Romney  36.166283
1  13229            Francisco Barrio  36.331804
6  31423              Walter Mondale  36.400549
3  22745  Wynn Normington Hugh-Jones  36.496575
9  36364                  Don Bonker  36.633318
0   9210                Andy Anstett  36.959437


All of the 10 people are politicians, but about half of them have rather tenuous connections with Obama, other than the fact that they are politicians.

* Francisco Barrio is a Mexican politician, and a former governor of Chihuahua.
* Walter Mondale and Don Bonker are Democrats who made their career in late 1970s.
* Wynn Normington Hugh-Jones is a former British diplomat and Liberal Party official.
* Andy Anstett is a former politician in Manitoba, Canada.

Nearest neighbors with raw word counts got some things right, showing all politicians in the query result, but missed finer and important details.

For instance, let's find out why Francisco Barrio was considered a close neighbor of Obama.  To do this, let's look at the most frequently used words in each of Barack Obama and Francisco Barrio's pages:

First, run the following cell to obtain the word_count column, which represents the word count vectors in the dictionary form. This way, we can quickly recognize words of great importance.

In [162]:
def unpack_dict(matrix, map_index_to_word):
    #table = list(map_index_to_word.sort_values('index')['category'])
    # if you're not using SFrame, replace this line with
    table = sorted(map_index_to_word, key=map_index_to_word.get)
    
    
    data = matrix.data
    indices = matrix.indices
    indptr = matrix.indptr
    
    num_doc = matrix.shape[0]

    return [{k:v for k,v in zip([table[word_id] for word_id in indices[indptr[i]:indptr[i+1]] ],
                                 data[indptr[i]:indptr[i+1]].tolist())} \
               for i in range(num_doc) ]
wiki['word_count'] = unpack_dict(word_count, map_index_to_word)

In [163]:
wiki.head(10)

Unnamed: 0,URI,name,text,id,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,0,"{13041: 1, 346847: 1, 316662: 1, 154234: 1, 20..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,1,"{373046: 1, 244901: 1, 521474: 1, 135790: 1, 2..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,2,"{94632: 1, 118056: 1, 305696: 2, 486964: 1, 23..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,3,"{400212: 1, 193551: 1, 55880: 1, 144727: 1, 36..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,4,"{341131: 3, 236658: 1, 4797: 1, 111298: 1, 546..."
5,<http://dbpedia.org/resource/Sam_Henderson>,Sam Henderson,sam henderson born october 18 1969 is an ameri...,5,"{347747: 1, 167474: 1, 434420: 1, 208249: 3, 2..."
6,<http://dbpedia.org/resource/Aaron_LaCrate>,Aaron LaCrate,aaron lacrate is an american music producer re...,6,"{182213: 1, 49614: 1, 257366: 1, 94693: 1, 301..."
7,<http://dbpedia.org/resource/Trevor_Ferguson>,Trevor Ferguson,trevor ferguson aka john farrow born 11 novemb...,7,"{173648: 1, 210370: 1, 247036: 1, 27823: 1, 36..."
8,<http://dbpedia.org/resource/Grant_Nelson>,Grant Nelson,grant nelson born 27 april 1971 in london also...,8,"{380652: 1, 457074: 1, 429264: 3, 98405: 1, 40..."
9,<http://dbpedia.org/resource/Cathy_Caruth>,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes pro...,9,"{130273: 1, 256088: 1, 242060: 1, 518176: 1, 2..."


To make things even easier, we provide a utility function that displays a dictionary in tabular form:

In [164]:
def top_words(name): # modified for pandas 0.21.0 python 2.7.11
    """
    Get a table of the most frequent words in the given person's wikipedia page.
    """
    row = wiki[wiki['name'] == name]
    
    #new code
    word_count_table = pd.DataFrame.from_dict(row['word_count'].values[0], orient='index').reset_index(inplace=False)
    word_count_table.rename(columns = {'index':'word', 0:'count'}, inplace=True)
    return word_count_table.sort_values('count', ascending=False)

obama_words = top_words('Barack Obama')
print (obama_words)


       word  count
272  501004     40
270   50472     30
271   65732     21
269  167945     18
266  378099     14
258  340234     11
71   155572      9
138  363790      8
260  136295      7
268  432838      7
191  203182      6
263   18945      6
212  310952      6
264  518633      5
265  496825      4
97   245696      4
166  187700      4
129  273043      4
155  359042      4
248  246709      4
217  316643      4
254  525602      4
252  294580      3
256   58401      3
216  429439      3
221  173789      3
222  199295      3
214  313526      3
165  152696      3
261    2056      3
..      ...    ...
83   302572      1
66   537165      1
67   512689      1
68   164153      1
70   487105      1
72   289067      1
73   153941      1
74   149896      1
75   492953      1
76   478081      1
77   504875      1
78   533815      1
79    57919      1
80   464847      1
84    56032      1
102  319785      1
85   269273      1
86    62098      1
87   240912      1
88   344401      1
89   301884 

In [165]:
barrio_words = top_words('Francisco Barrio')
print (barrio_words)    

       word  count
224  501004     36
221  167945     24
223   65732     18
222   50472     17
212  136295     10
218  378099      9
19   120546      7
220  432838      6
111   24050      6
210  340234      5
215   18945      5
216  518633      4
178  197712      4
208   58401      4
199  176053      3
107   86949      3
193  206529      3
201  143888      3
204  294580      3
155  176906      3
217  496825      3
20   450426      2
195  227358      2
96   273043      2
219  136146      2
135  244402      2
21   236317      2
145  127337      2
186   96304      2
187  260793      2
..      ...    ...
114  157241      1
115  199543      1
89   174551      1
87    93490      1
61   528162      1
86   346919      1
62   393878      1
63   355902      1
64   295606      1
65    34046      1
66    89312      1
67   342730      1
68    25971      1
69   151681      1
70   256349      1
71   346002      1
72   384788      1
73   337763      1
75   112269      1
76    57843      1
77    44995 

Let's extract the list of most frequent words that appear in both Obama's and Barrio's documents. We've so far sorted all words from Obama and Barrio's articles by their word frequencies. We will now use a dataframe operation known as join. The join operation is very useful when it comes to playing around with data: it lets you combine the content of two tables using a shared column (in this case, the word column). See the documentation for more details.

In [166]:
combined_words = pd.merge(obama_words, barrio_words, on='word', how='inner')
combined_words

Unnamed: 0,word,count_x,count_y
0,501004,40,36
1,50472,30,17
2,65732,21,18
3,167945,18,24
4,378099,14,9
5,340234,11,5
6,136295,7,10
7,432838,7,6
8,18945,6,5
9,518633,5,4


Since both tables contained the column named count, SFrame automatically renamed one of them to prevent confusion. Let's rename the columns to tell which one is for which. By inspection, we see that the first column (count) is for Obama and the second (count.1) for Barrio.

In [167]:
combined_words = combined_words.rename(columns = {'count_x':'Obama', 'count_y':'Barrio'}, inplace = False)

In [168]:
combined_words

Unnamed: 0,word,Obama,Barrio
0,501004,40,36
1,50472,30,17
2,65732,21,18
3,167945,18,24
4,378099,14,9
5,340234,11,5
6,136295,7,10
7,432838,7,6
8,18945,6,5
9,518633,5,4


In [169]:
combined_words.sort_values(by=['Obama'], ascending=False)

Unnamed: 0,word,Obama,Barrio
0,501004,40,36
1,50472,30,17
2,65732,21,18
3,167945,18,24
4,378099,14,9
5,340234,11,5
6,136295,7,10
7,432838,7,6
8,18945,6,5
9,518633,5,4


In [170]:
combined_words.sort_values(by=['Obama'], ascending=False)['word'][:4]

0    501004
1     50472
2     65732
3    167945
Name: word, dtype: int64

Quiz Question. Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

Hint:

Refer to the previous paragraph for finding the words that appear in both articles. Sort the common words by their frequencies in Obama's article and take the largest five.
Each word count vector is a Python dictionary. For each word count vector in SFrame, you'd have to check if the set of the 5 common words is a subset of the keys of the word count vector. Complete the function has_top_words to accomplish the task.

Convert the list of top 5 words into set using the syntax "set(common_words)", where common_words is a Python list. See this link if you're curious about Python sets.
Extract the list of keys of the word count dictionary by calling the keys() method.
Convert the list of keys into a set as well.
Use issubset() method to check if all 5 words are among the keys.
Now apply the has_top_words function on every row of the SFrame.
Compute the sum of the result column to obtain the number of articles containing all the 5 top words.

In [171]:
common_words = combined_words.sort_values(by=['Obama'], ascending=False)['word'][:5]  # YOUR CODE HERE

def has_top_words(word_count_vector):
    # extract the keys of word_count_vector and convert it to a set
    unique_words = list(word_count_vector.keys())   # YOUR CODE HERE
    # return True if common_words is a subset of unique_words
    # return False otherwise
    return all([z in unique_words for z in common_words])  # YOUR CODE HERE

wiki['has_top_words'] = wiki['word_count'].apply(has_top_words)

# use has_top_words column to answer the quiz question
np.sum(wiki['has_top_words']) # YOUR CODE HERE

56066

In [172]:
wiki.head()

Unnamed: 0,URI,name,text,id,word_count,has_top_words
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,0,"{13041: 1, 346847: 1, 316662: 1, 154234: 1, 20...",True
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,1,"{373046: 1, 244901: 1, 521474: 1, 135790: 1, 2...",True
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,2,"{94632: 1, 118056: 1, 305696: 2, 486964: 1, 23...",True
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,3,"{400212: 1, 193551: 1, 55880: 1, 144727: 1, 36...",True
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,4,"{341131: 3, 236658: 1, 4797: 1, 111298: 1, 546...",False


In [173]:
combined_words.head()

Unnamed: 0,word,Obama,Barrio
0,501004,40,36
1,50472,30,17
2,65732,21,18
3,167945,18,24
4,378099,14,9


In [174]:
print('Answer:56066')

Answer:56066


Checkpoint. Check your has_top_words function on two random articles:

In [175]:
print ('Output from your function:', has_top_words(wiki.iloc[32]['word_count']))
print ('Correct output: True')
print ('Also check the length of unique_words. It should be 167')

print ('Output from your function:', has_top_words(wiki.iloc[33]['word_count']))
print ('Correct output: False')
print ('Also check the length of unique_words. It should be 188')

Output from your function: True
Correct output: True
Also check the length of unique_words. It should be 167
Output from your function: False
Correct output: False
Also check the length of unique_words. It should be 188


Quiz Question. Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?

Hint: For this question, take the row vectors from the word count matrix that correspond to Obama, Bush, and Biden. To compute the Euclidean distance between any two sparse vectors, use sklearn.metrics.pairwise.euclidean_distances.

In [176]:
from sklearn.metrics.pairwise import euclidean_distances

In [177]:
row_x = wiki[wiki['name'] == 'Barack Obama'].index
row_y = wiki[wiki['name'] == 'George W. Bush'].index
row_z = wiki[wiki['name'] == 'Joe Biden'].index

In [178]:
x=word_count[row_x]
y=word_count[row_y]
z=word_count[row_z]

In [179]:
euclidean_distances(x, y)

array([[ 34.39476704]])

In [180]:
euclidean_distances(x, z)

array([[ 33.07567082]])

In [181]:
euclidean_distances(y, z)

array([[ 32.75667871]])

In [182]:
print('Answer: Bush and Biden')

Answer: Bush and Biden


Quiz Question. Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama's page.

Note. Even though common words are swamping out important subtle differences, commonalities in rarer political words still matter on the margin. This is why politicians are being listed in the query result instead of musicians, for example. In the next subsection, we will introduce a different metric that will place greater emphasis on those rarer words.

In [183]:
bush_words = top_words('George W. Bush') 
combined_words_obama_bush = pd.merge(obama_words, bush_words, on='word', how='inner')
combined_words_obama_bush = combined_words_obama_bush.rename(columns = {'count_x':'Obama', 'count_y':'Bush'}, inplace = False)

In [184]:
combined_words_obama_bush.sort_values(by=['Obama'], ascending=False)['word'][:10]

0    501004
1     50472
2     65732
3    167945
4    378099
5    340234
6    363790
7    136295
8    432838
9    203182
Name: word, dtype: int64

In [185]:
print('Answer:see above')

Answer:see above


## Extract the TF-IDF vectors

Much of the perceived commonalities between Obama and Barrio were due to occurrences of extremely frequent words, such as "the", "and", and "his". So nearest neighbors is recommending plausible results sometimes for the wrong reasons. 

To retrieve articles that are more relevant, we should focus more on rare words that don't happen in every article. **TF-IDF** (term frequency–inverse document frequency) is a feature representation that penalizes words that are too common.  Let us load in the TF-IDF vectors and repeat the nearest neighbor search.

For your convenience, we extracted the TF-IDF vectors from the dataset. The vectors are packaged in a sparse matrix, where the i-th row gives the TF-IDF vectors for the i-th document. Each column corresponds to a unique word appearing in the dataset. The mapping between words and integer indices are given in people_wiki_map_index_to_word.gl.


In [186]:
tf_idf = load_sparse_csr('people_wiki_tf_idf.npz')

In addition to the sparse matrix, we also store the TF-IDF vectors in dictionary form as well, to allow for easy interpretation.

In [192]:
wiki['tf_idf'] = unpack_dict(tf_idf, map_index_to_word)

In [193]:
wiki.iloc[35817]

URI                     <http://dbpedia.org/resource/Barack_Obama>
name                                                  Barack Obama
text             barack hussein obama ii brk husen bm born augu...
id                                                           35817
word_count       {198842: 1, 305469: 1, 160223: 1, 162899: 1, 4...
has_top_words                                                 True
tf_idf           {198842: 10.986495389225194, 305469: 10.986495...
Name: 35817, dtype: object

## Find nearest neighbors using TF-IDF vectors

In [136]:
model_tf_idf = NearestNeighbors(metric='euclidean', algorithm='brute')
model_tf_idf.fit(tf_idf)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

Perform the nearest neighbor search by running

In [137]:
distances, indices = model_tf_idf.kneighbors(tf_idf[35817], n_neighbors=10)

In [138]:
distances

array([[   0.        ,  106.86101369,  108.87167422,  109.04569791,
         109.10810617,  109.78186711,  109.95778808,  110.41388872,
         110.4706087 ,  110.696998  ]])

To print the names of the articles, we perform join using the indices:

In [143]:
neighbors = pd.DataFrame({'distance':distances[0].tolist(), 'id':indices[0].tolist()})
print (pd.merge(wiki, neighbors, on='id', how='inner').sort_values(by=['distance'])[['id','name','distance']])

      id                     name    distance
3  35817             Barack Obama    0.000000
1   7914            Phil Schiliro  106.861014
9  46811            Jeff Sessions  108.871674
7  44681   Jesse Lee (politician)  109.045698
4  38376           Samantha Power  109.108106
0   6507             Bob Menendez  109.781867
5  38714  Eric Stern (politician)  109.957788
8  44825           James A. Guest  110.413889
6  44368     Roland Grossenbacher  110.470609
2  33417            Tulsi Gabbard  110.696998


Let's determine whether this list makes sense.

With a notable exception of Roland Grossenbacher, the other 8 are all American politicians who are contemporaries of Barack Obama.

Phil Schiliro, Jesse Lee, Samantha Power, and Eric Stern worked for Obama.

Clearly, the results are more plausible with the use of TF-IDF. Let's take a look at the word vector for Obama and Schilirio's pages. Notice that TF-IDF representation assigns a weight to each word. This weight captures relative importance of that word in the document. Let us sort the words in Obama's article by their TF-IDF weights; we do the same for Schiliro's article as well.

In [145]:
def top_words_tf_idf(name):
    row = wiki[wiki['name'] == name]
    #word_count_table = row[['tf_idf']].stack('tf_idf', new_column_name=['word','weight'])
    #return word_count_table.sort('weight', ascending=False)
    #new code
    word_count_table = pd.DataFrame.from_dict(row['tf_idf'].values[0], orient='index').reset_index(inplace=False)
    word_count_table.rename(columns = {'index':'word', 0:'weight'}, inplace=True)
    return word_count_table.sort_values('weight', ascending=False)

obama_tf_idf = top_words_tf_idf('Barack Obama')
print (obama_tf_idf)

schiliro_tf_idf = top_words_tf_idf('Phil Schiliro')
print (schiliro_tf_idf)

       word     weight
71   155572  43.295653
138  363790  27.678223
97   245696  17.747379
129  273043  14.887061
191  203182  14.722936
69   417385  14.533374
155  359042  13.115933
105  536997  12.784385
104   61664  12.784385
166  187700  12.410689
212  310952  11.591943
1    305469  10.986495
0    198842  10.986495
2    160223  10.986495
3    162899  10.293348
4      4399  10.293348
143   36843  10.164288
5    270913   9.887883
81   318996   9.431014
82   402727   9.419704
165  152696   9.319342
171  196163   9.077468
6    410930   9.040585
95   108674   8.967411
7    156826   8.907054
98   344074   8.842461
101   72107   8.698475
8      5731   8.421546
110  318918   8.281231
189   13602   7.712676
..      ...        ...
234  463058   1.496782
260  136295   1.493580
235  470632   1.491503
236  199294   1.487973
237   27072   1.487823
238   67570   1.442401
239  226481   1.430935
240  513620   1.383640
245  181146   1.098883
246   43824   1.089076
247  143888   1.075238
249  403051

Using the join operation we learned earlier, try your hands at computing the common words shared by Obama's and Schiliro's articles. Sort the common words by their TF-IDF weights in Obama's document. The first 10 words should say: Obama, law, democratic, Senate, presidential, president, policy, states, office, 2011.