## Retrieving Wikipedia articles

In this module, we focused on using nearest neighbors and clustering to retrieve documents that interest users, by analyzing their text. We explored two document representations: word counts and TF-IDF. We also built an iPython notebook for retrieving articles from Wikipedia about famous people.

In this assignment, we are going to dig deeper into this application, explore the retrieval results for various famous people, and familiarize ourselves with the code needed to build a retrieval system. These techniques will be key to building the intelligent application in your capstone project.

### What you will do
Now you are ready! We are going do three tasks in this assignment. There are several results you need to gather along the way to enter into the quiz after this reading.

### Compare top words according to word counts to TF-IDF: 
In the notebook we covered in the module, we explored two document representations: word counts and TF-IDF. Now, take a particular famous person, 'Elton John'. What are the 3 words in his articles with highest word counts? What are the 3 words in his articles with highest TF-IDF? These results illustrate why TF-IDF is useful for finding important words. **Save these results to answer the quiz at the end.**

In [1]:
import graphlab
people = graphlab.SFrame('../week4/people_wiki.gl/')
people.head(3)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1523506357.log


This non-commercial license of GraphLab Create for academic use is assigned to jaekeunprk@gmail.com and will expire on March 15, 2019.


URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...


In [2]:
# word_count column to people table
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

# tfidf column to people table
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])
people.head(3)

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'selection': 1, 'carltons': 1, 'being': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1, 'thomas': 1, 'closely': 1, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1, 'issued': 1, 'mainly': 1, 'nominat ..."

tfidf
"{'selection': 3.836578553093086, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."


In [3]:
elton = people[people['name'] == 'Elton John']

In [4]:
elton

URI,name,text,word_count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,"{'all': 1, 'least': 1, 'producer': 1, 'heavi ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [5]:
# the 3 words in his articles with highest word counts
elton_word_count_table = elton[['word_count']].stack('word_count', new_column_name=['word', 'count']).sort('count', ascending=False)
elton_word_count_table.head(3)

word,count
the,27
in,18
and,15


In [6]:
#  the 3 words in his articles with highest TF-IDF
elton_tfidf_table = elton[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf']).sort('tfidf', ascending=False)
elton_tfidf_table.head(3)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575


### Measuring distance:
Elton John is a famous singer; let’s compute the distance between his article and those of two other famous singers. In this assignment, you will use the cosine distance, which one measure of similarity between vectors, similar to the one discussed in the lectures. You can compute this distance using the graphlab.distances.cosine function. What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’? Which one of the two is closest to Elton John? Does this result make sense to you? **Save these results to answer the quiz at the end.**

In [7]:
# cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’
elton = people[people['name'] == 'Elton John']
victoria = people[people['name'] == 'Victoria Beckham']

graphlab.distances.cosine(elton['tfidf'][0], victoria['tfidf'][0])

0.9567006376655429

In [8]:
# cosine distance between the articles on ‘Elton John’ and Paul McCartney’
paul = people[people['name'] == 'Paul McCartney']

graphlab.distances.cosine(elton['tfidf'][0], paul['tfidf'][0])

0.8250310029221779

### Building nearest neighbors models with different input features and setting the distance metric:
In the sample notebook, we built a nearest neighbors model for retrieving articles using TF-IDF as features and using the default setting in the construction of the nearest neighbors model. Now, you will build two nearest neighbors models:Using word counts as featuresUsing TF-IDF as features

In both of these models, we are going to set the distance function to cosine similarity. Here is how: when you call the function

`graphlab.nearest_neighbors.create`

add the parameter:

`distance='cosine'`


In [9]:
# build word_count_knn model
word_count_knn = graphlab.nearest_neighbors.create(people, features=['word_count'], label='name', distance='cosine')

In [10]:
# build tfidf_knn model
tfidf_knn = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name', distance='cosine')

Now we are ready to use our model to retrieve documents. Use these two models to collect the following results:

What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

**Save these results to answer the quiz at the end.**



In [11]:
# the most similar article, other than itself, to the one on ‘Elton John’ using word count features
word_count_knn.query(elton)

query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


In [12]:
# the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features
tfidf_knn.query(elton)

query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [13]:
# the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features
word_count_knn.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


In [14]:
# the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features
tfidf_knn.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5
