# Document retrieval from wikipedia data

# Fire up GraphLab Create

In [1]:
import graphlab

# Load some text data - from wikipedia, pages on people

In [2]:
import requests, zipfile, io

r = requests.get('https://s3-us-west-1.amazonaws.com/storagebucketmachinelearning/people_wiki.gl.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [3]:
people = graphlab.SFrame('people_wiki.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1540368102.log
INFO:graphlab.cython.cy_server:GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1540368102.log


This non-commercial license of GraphLab Create for academic use is assigned to makarovartyom.ma@gmail.com and will expire on October 10, 2019.


Data contains:  link to wikipedia article, name of person, text of article.

In [4]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [5]:
len(people)

59071

In [6]:
people.shape

(59071, 3)

# Explore the dataset and checkout the text it contains

## Exploring the entry for president Obama

Choose the text about Obama only for now using logical filtering and create the filter.

In [11]:
obama = people[people['name'] == 'Barack Obama']

In [12]:
obama

URI,name,text
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...


In [13]:
obama['text'] # see howw the text looks like

dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campa

## Exploring the entry for actor George Clooney

Exploring the same with George Clooney - filter and variable.

In [15]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

dtype: str
Rows: ?
['george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film wit

# Get the word counts for Obama article

So here we will use **text_analytics.count_words()** method the same as in sentiment analysis to retrieve the unique words from text into a set.<br>
We feed the variable - obama['text'] and create new column - 'word_count'.

In [17]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [18]:
print obama['word_count']

[{'operations': 1, 'represent': 1, 'office': 2, 'unemployment': 1, 'is': 2, 'doddfrank': 1, 'over': 1, 'unconstitutional': 1, 'domestic': 2, 'named': 1, 'ending': 1, 'ended': 1, 'proposition': 1, 'seats': 1, 'graduate': 1, 'worked': 1, 'before': 1, 'death': 1, '20': 2, 'taxpayer': 1, 'inaugurated': 1, 'obamacare': 1, 'civil': 1, 'mccain': 1, 'to': 14, '4': 1, 'policy': 2, '8': 1, 'has': 4, '2011': 3, '2010': 2, '2013': 1, '2012': 1, 'bin': 1, 'then': 1, 'his': 11, 'march': 1, 'gains': 1, 'cuba': 1, 'californias': 1, '1992': 1, 'new': 1, 'not': 1, 'during': 2, 'years': 1, 'continued': 1, 'presidential': 2, 'husen': 1, 'osama': 1, 'term': 3, 'equality': 1, 'prize': 1, 'lost': 1, 'stimulus': 1, 'january': 3, 'university': 2, 'rights': 1, 'gun': 1, 'republican': 2, 'rodham': 1, 'troop': 1, 'withdrawal': 1, 'involvement': 3, 'response': 3, 'where': 1, 'referred': 1, 'affordable': 1, 'attorney': 1, 'school': 3, 'senate': 3, 'house': 2, 'national': 2, 'creation': 1, 'related': 1, 'hawaii': 1,

## Sort the word counts for the Obama article

### Turning dictonary of word counts into a table

DataFrame.stack(level=-1, dropna=True)<br>

**Pivot a level of the (possibly hierarchical) column labels**, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels. The level involved will automatically get sorted.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.stack.html

Below we just **split 'word_counts' column in two** and specify the columns names and assign 'word' and 'count' to new ones.

In [19]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

In [21]:
obama_word_count_new = obama[['word_count']].stack('word_count')

It's the same as to retrieve the keys from set of data.

### Sorting the word counts to show most common words at the top

In [24]:
obama_word_count_table.head()

word,count
cuba,1
relations,1
sought,1
combat,1
ending,1
withdrawal,1
state,1
islamic,1
by,1
gains,1


In [25]:
obama_word_count_new.head()

X1,X2
cuba,1
relations,1
sought,1
combat,1
ending,1
withdrawal,1
state,1
islamic,1
by,1
gains,1


In [26]:
obama_word_count_new.sort('X2',ascending=False)

X1,X2
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
he,7
a,7


In [27]:
obama_word_count_table.sort('count',ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
he,7
a,7


Here we can see the top list of frequent words. Most common words include uninformative words like "the", "in", "and",...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

Here is a graphlab method, let's see **how this works in python**: https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3

**Can't just compute TF/IDF for the Obama article in isolation** because tf/idf depends on **entire corpus**. You need that normalizer which is the **number of times a word appears in every article**.<br>
So, I have to show it I have computed for the entire corpus. 

Let's start again with text_analytics.count_words() for entire text in people SFrame. 

In [29]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."


To compute tf-idf we use method **text_analytics.tf_idf()** and feed the 'word_count' column as an argument:<br>

1. This compute words frecuency - tf;<br>
2. Normalize the vector of words;<br>
3. Compute idf;<br>
4. The makes **product of tf-idf multiplying them**. <br>

We gonna add one column with tf-idf results in people dataframe.

In [37]:
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])

In [38]:
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1, 'bauforschung': 1, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'they': 1, 'gangstergenka': 1, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'currently': 1, 'less': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2, 'producer': 1, 'show' ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1, 'salon': 1, 'gangs': 1, 'being': 1, ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1, 'frankie': 1, 'labels': 1, ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1, 'deborash': 1, 'both' ..."

docs,tfidf
"{'selection': 3.836578553093086, ...","{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ...","{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ...","{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ...","{'all': 1.6431112434912472, ..."
"{'they': 1.8993401178193898, ...","{'they': 1.8993401178193898, ..."
"{'currently': 1.637088969126014, ...","{'currently': 1.637088969126014, ..."
"{'exclusive': 10.455187230695827, ...","{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ...","{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ...","{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ...","{'phenomenon': 5.750053426395245, ..."


## Examine the TF-IDF for the Obama article

Now we take Obama article and stack column 'tfidf' in two columns 'word' and 'tfidf' and then, sort it.

In [41]:
obama = people[people['name'] == 'Barack Obama']

In [42]:
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


Words with highest TF-IDF are much more informative - they are the keys in computing similarity between docs.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

Let's take 3 people - Obama, Clinton and Beckham and compute the similarity between them. 

In [49]:
clinton = people[people['name'] == 'Bill Clinton']

In [51]:
clinton.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Bill_Clinton> ...,Bill Clinton,william jefferson bill clinton born william ...,"{'rating': 1, 'serving': 1, 'surplus': 1, ..."

docs,tfidf
"{'rating': 5.377023594040234, ...","{'rating': 5.377023594040234, ..."


In [50]:
beckham = people[people['name'] == 'David Beckham']

## Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

Cosine similarity: https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/

From each document we derive a vector. If you need some refresher on vector refer here - https://www.mathsisfun.com/algebra/vectors-dot-product.html. <br>
The set of documents in a collection then is viewed as a set of vectors in a vector space. Each term will have its own axis.<br> Using the formula given below we can find out the similarity between any two documents.

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||<br>

Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]<br>

||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)<br>
||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)

1) We take normalized tf and idf, multiply them for specific query (vector of words - like Obama, Beckham, Clinton and specific document)<br>
2) We take a square root of each computation form 1)<br>
3) Estimate cosine similarity based on formula - Dot product(d1, d2) / ||d1|| * ||d2||

We're gonna use cosine distance. And just as a little note, normally we think about cosine similarity, if you've heard of it.<br>
Where the higher the number the more similar two articles are.<br>
Here, we have a distance version of this number, so the lower the better. <br>
**!!! The lower the cosine distance, the closer the articles are. **<br>
So the question is, **what is the cosine distance between Obama's tfidf and that of Clinton?**<br>
But notice that I have selected the column tfidf, and I have to have the little 0 at the end here, because it is the zeroth row of this table. The table only has **one element in it**, but we still have to **say what row of the table we're looking at**. And so we're gonna compare the Obama tfidf with the Clinton tfidf. 

Конкретизируем то, что мы берем первую строчку (одна в документе Clinton или Obama или Beckham) и указываем obama['tfidf'][0].

In [52]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

0.8339854936884276

In [53]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

0.9791305844747478

And so in this case, Obama is much closer to Clinton than he is to Beckham, which makes a lot of sense.<br>
But we've done this just manually for a few people, how do we automate this process of finding out how close an article is to other articles. And in this case how close is a person to other people. 


# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

We use the nearst_neighbors model and pass 'tfidf' as features and target (label) as 'name' to see the similarity between article name based on tfidf.

In [54]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

We gonna see the closest article for Barack Obama taking distance - the lower distance the better. 

In [55]:
knn_model.query(obama)

query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


And I say, okay, the closest person to Obama is Obama himself, which makes sense. But after that, we have Joe Biden, who is the current US vice president. <br>
So the president is closest to the vice president. And then we find a few other politicians, including former president Bill Clinton, which we saw above. 

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

## Other examples of document retrieval

In [56]:
swift = people[people['name'] == 'Taylor Swift']

In [57]:
knn_model.query(swift)

query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In other words, who are the people, just for the model, who are the people who are closest to Taylor Swift?<br>
So she's a female singer and you'll see other female singers kind of the same generation like Carrie Underwood, Alicia Keys, Jordan Sparks, Leona Lewis. 

In [58]:
jolie = people[people['name'] == 'Angelina Jolie']

In [59]:
knn_model.query(jolie)

query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


In [60]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [61]:
knn_model.query(arnold)

query_label,reference_label,distance,rank
0,Arnold Schwarzenegger,0.0,1
0,Jesse Ventura,0.818918918919,2
0,John Kitzhaber,0.824615384615,3
0,Lincoln Chafee,0.833876221498,4
0,Anthony Foxx,0.833910034602,5


#### Later let's do this for Last.fm to search the closest singers 

### Question 1

In [62]:
elton = people[people['name'] == 'Elton John']

What are the 3 words in his articles with highest word counts?

In [63]:
elton['word_count']

dtype: dict
Rows: ?
[{'all': 1, 'least': 1, 'producer': 1, 'heavily': 1, 'inducted': 1, 'john': 7, 'over': 2, 'named': 1, 'making': 1, 'years': 1, 'four': 1, 'openly': 1, 'including': 1, 'highestprofile': 1, 'its': 2, 'impact': 1, '1': 2, '27': 1, '21': 2, 'wed': 1, 'datein': 1, 'royal': 1, '1947': 1, 'abbey': 1, 'winning': 1, 'late': 1, 'to': 4, 'taupin': 1, 'born': 1, '2014': 1, 'as': 2, 'has': 9, '2013': 1, 'his': 4, 'march': 1, '10': 1, 'songwriter': 2, 'solo': 1, 'continues': 1, 'records': 1, 'five': 1, 'occasional': 1, 'they': 1, 'inception': 1, 'world': 1, 'one': 3, 'hall': 2, 'bestselling': 2, 'fivedecade': 1, 'knighthood': 1, '58': 1, 'artist': 1, 'roll': 2, 'inductee': 1, 'list': 1, 'events': 1, 'hercules': 1, 'announced': 1, 'rock': 2, 'alltime': 1, 'brit': 1, 'bernie': 1, 'england': 1, 'concert': 1, 'be': 1, 'diana': 1, 'globe': 1, 'artists': 2, 'him': 3, 'culture': 1, 'year': 1, 'billboard': 4, 'aids': 2, 'empire': 1, 'honors': 1, 'composers': 1, 'established': 1, 'elton':

In [64]:
elton_word_counts=elton[['word_count']].stack('word_count', new_column_name = ['word','count'])

In [65]:
elton_word_counts.sort('count',ascending=False)

word,count
the,27
in,18
and,15
of,13
a,10
has,9
john,7
he,7
on,6
award,5


Words: **'the', 'in', 'and'** are most frequent.

In terms of TF-IDF what are the most valuable words?

In [67]:
elton[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
tonightcandle,10.9864953892
overallelton,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


Words **'furnish', 'elton', 'billboard'** have the highest rating.

### Question 2

In [68]:
victoria_beckham = people[people['name'] == 'Victoria Beckham']

In [69]:
graphlab.distances.cosine(elton['tfidf'][0], victoria_beckham['tfidf'][0])

0.9567006376655429

Here we can see that similarity between Victoria Beckham and Elton John is minimal.

In [70]:
mccartney= people[people['name'] == 'Paul McCartney']

In [71]:
graphlab.distances.cosine(elton['tfidf'][0], mccartney['tfidf'][0])

0.8250310029221779

The cosine distance between the observations of Paul McCartney and Victoria Beckham shows the highest similarity os 2nd one (musicains both).

In [72]:
knn_model_distance = graphlab.nearest_neighbors.create(people, features=['word_count'],label='name', distance='cosine')

In [73]:
knn_model_distance.query(elton)

query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features? <br>

**Cliff Richard**

What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?

In [74]:
knn_model.query(elton)

query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


**Phil Collins**

What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?<br>
What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

In [75]:
knn_model_distance.query(victoria_beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


**Mary Fitzgerald (artist)**

In [76]:
knn_model.query(victoria_beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


**Cheryl Cole**

In [77]:
knn_model_distance_new = graphlab.nearest_neighbors.create(people, features=['tfidf'],label='name', distance='cosine')

In [78]:
knn_model_distance_new.query(elton)

query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [79]:
knn_model_distance_new.query(victoria_beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
