# First get data

In [18]:
import graphlab

In [19]:
people = graphlab.SFrame('people_wiki.gl/')

In [20]:
len(people)

59071

# Now get word count and TFIDF for all people

In [21]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

In [22]:
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])

In [23]:
people.head(2)

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."

tfidf
"{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ..."


# Task 1) Compare top words according to word counts to TF-IDF: 
In the notebook we covered in the module, we explored two document representations: word counts and TF-IDF. Now, take a particular famous person, 'Elton John'. 
* What are the 3 words in his articles with highest word counts? 
* What are the 3 words in his articles with highest TF-IDF? 

These results illustrate why TF-IDF is useful for finding important words. 

Save these results to answer the quiz at the end.

In [24]:
elton = people[people['name'] == 'Elton John']

In [25]:
elton

URI,name,text,word_count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,"{'all': 1, 'least': 1, 'producer': 1, 'heavi ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [26]:
elton['word_count'] = graphlab.text_analytics.count_words(elton['text'])

In [27]:
elton_word_count_table = elton[['word_count']].stack('word_count', new_column_name=['word','count']).sort('count',ascending=False)

## Elton top 3 words

In [28]:
elton_word_count_table.head(3)

word,count
the,27
in,18
and,15


## Elton words with top TFIDF

In [29]:
elton[['tfidf']].stack('tfidf', new_column_name=['word','tfidf']).sort('tfidf', ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
tonightcandle,10.9864953892
overallelton,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


# Task 2) Measuring distance: 

Elton John is a famous singer; let’s compute the distance between his article and those of two other famous singers. In this assignment, you will use the cosine distance, which one measure of similarity between vectors, similar to the one discussed in the lectures. You can compute this distance using the graphlab.distances.cosine function. 

* What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? 
* What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’? 
* Which one of the two is closest to Elton John? 
* Does this result make sense to you? 

Save these results to answer the quiz at the end.

## What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’?

In [30]:
victoria_beckham = people[people['name'] == 'Victoria Beckham']

In [31]:
len(victoria_beckham)

1

In [32]:
graphlab.distances.cosine(elton['tfidf'][0],victoria_beckham['tfidf'][0])

0.9567006376655429

## What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’?

In [33]:
mccartney = people[people['name'] == 'Paul McCartney']

In [34]:
graphlab.distances.cosine(elton['tfidf'][0],mccartney['tfidf'][0])

0.8250310029221779

## Which one of the two is closest to Elton John?

Paul McCartney

## Does this result make sense to you?

Yes because both are musicians, both british, both same era.

# Task 3) Building nearest neighbors models with different input features and setting the distance metric: 

In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model. Now, you will build two nearest neighbors models:

* Using word counts as features
* Using TF-IDF as features

In both of these models, we are going to set the distance function to cosine similarity. Here is how: when you call the function

`graphlab.nearest_neighbors.create`

add the parameter:

`distance='cosine'`

Now we are ready to use our model to retrieve documents. Use these two models to collect the following results:

* What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?
* What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?
* What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?
* What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

Save these results to answer the quiz at the end.

In [46]:
tfidf_knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'],label='name',distance='cosine')

In [47]:
word_count_knn_model = graphlab.nearest_neighbors.create(people, features=['word_count'],label='name',distance='cosine')

## What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?

### Answer = Cliff Richard

In [48]:
word_count_knn_model.query(elton)

query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


## What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?

### Answer) Rod Stewart

In [49]:
tfidf_knn_model.query(elton)

query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


## What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?

### Answer) Mary Fitzgerald (artist)

In [50]:
word_count_knn_model.query(victoria_beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


## What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

### Answer) David Beckham

In [51]:
tfidf_knn_model.query(victoria_beckham)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
