# Document retrieval from wikipedia data

## Fire up GraphLab Create

In [1]:
import graphlab

# Load some text data - from wikipedia, pages on people

In [2]:
people = graphlab.SFrame('./data/people_wiki.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to dheeeraj.agarwal@gmail.com and will expire on October 25, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\DrjNupur\AppData\Local\Temp\graphlab_server_1479686836.log.0


Data contains:  link to wikipedia article, name of person, text of article.

In [3]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [4]:
len(people)

59071

# Explore the dataset and checkout the text it contains

## Exploring the entry for president Obama

In [5]:
obama = people[people['name'] == 'Barack Obama']

In [6]:
obama

URI,name,text
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...


In [7]:
obama['text']

dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campa

## Exploring the entry for actor George Clooney

In [8]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

dtype: str
Rows: ?
['george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film wit

# Get the word counts for Obama article

In [9]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])
obama['word_count']

dtype: dict
Rows: 1
[{'operations': 1L, 'represent': 1L, 'office': 2L, 'unemployment': 1L, 'doddfrank': 1L, 'over': 1L, 'unconstitutional': 1L, 'domestic': 2L, 'major': 1L, 'years': 1L, 'against': 1L, 'proposition': 1L, 'seats': 1L, 'graduate': 1L, 'debate': 1L, 'before': 1L, 'death': 1L, '20': 2L, 'taxpayer': 1L, 'representing': 1L, 'obamacare': 1L, 'barack': 1L, 'to': 14L, '4': 1L, 'policy': 2L, '8': 1L, 'he': 7L, '2011': 3L, '2010': 2L, '2013': 1L, '2012': 1L, 'bin': 1L, 'then': 1L, 'his': 11L, 'march': 1L, 'gains': 1L, 'cuba': 1L, 'school': 3L, '1992': 1L, 'new': 1L, 'not': 1L, 'during': 2L, 'ending': 1L, 'continued': 1L, 'presidential': 2L, 'states': 3L, 'husen': 1L, 'osama': 1L, 'californias': 1L, 'equality': 1L, 'prize': 1L, 'lost': 1L, 'made': 1L, 'inaugurated': 1L, 'january': 3L, 'university': 2L, 'rights': 1L, 'july': 1L, 'gun': 1L, 'stimulus': 1L, 'rodham': 1L, 'troop': 1L, 'withdrawal': 1L, 'brk': 1L, 'nine': 1L, 'where': 1L, 'referred': 1L, 'affordable': 1L, 'attorney': 1L

## Sort the word counts for the Obama article

### Turning dictonary of word counts into a table

In [10]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

### Sorting the word counts to show most common words at the top

In [11]:
obama_word_count_table.sort('count',ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
a,7
he,7


Most common words include uninformative words like "the", "in", "and",...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

In [12]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."


In [13]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

In [14]:
tfidf[0] # looking at the tfidf value for the first document

{'10': 2.3157231098806563,
 '1979': 2.6032908378122737,
 '19982000': 6.509158574746988,
 '2000': 1.8763068991994527,
 '2001': 1.9280249665871378,
 '2002': 1.8753125887822302,
 '2003': 1.8013702663900752,
 '2005': 1.6425861253275964,
 '2006': 1.520737905384506,
 '2007': 1.4879730697555795,
 '2008': 1.5093391374786154,
 '2009': 1.5644364836042695,
 '2011': 1.7023470901042919,
 '2013': 1.9545642372230505,
 '2014': 2.2073995783446634,
 '21': 2.797250863489293,
 '32': 4.3717697890214335,
 '44game': 9.887883100557085,
 'a': 0.022476737890332586,
 'acted': 4.137429106591736,
 'afl': 4.70049729471633,
 'aflfrom': 10.986495389225194,
 'against': 4.015921958283749,
 'age': 2.138848033513307,
 'along': 2.5088749729287803,
 'also': 0.4627270916162349,
 'and': 0.002980575592194913,
 'as': 0.2543390440248236,
 'assistant': 2.5220702633476124,
 'at': 0.8612771466165147,
 'australia': 2.86858644684204,
 'australian': 8.630007339620153,
 'before': 2.9935647453367427,
 'being': 1.7938099524877322,
 'blu

In [15]:
people['tfidf'] = tfidf

## Examine the TF-IDF for the Obama article

In [16]:
obama = people[people['name'] == 'Barack Obama']

In [17]:
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)

word,tfidf
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


Words with highest TF-IDF are much more informative.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [18]:
clinton = people[people['name'] == 'Bill Clinton']

In [19]:
beckham = people[people['name'] == 'David Beckham']

## Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

In [20]:
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])

0.8339854936884276

In [21]:
graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])

0.9791305844747478

# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

In [22]:
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

In [23]:
knn_model.query(obama)

query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

## Other examples of document retrieval

In [24]:
swift = people[people['name'] == 'Taylor Swift']

In [25]:
knn_model.query(swift)

query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In [26]:
jolie = people[people['name'] == 'Angelina Jolie']

In [27]:
knn_model.query(jolie)

query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


In [28]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [29]:
knn_model.query(arnold)

query_label,reference_label,distance,rank
0,Arnold Schwarzenegger,0.0,1
0,Jesse Ventura,0.818918918919,2
0,John Kitzhaber,0.824615384615,3
0,Lincoln Chafee,0.833876221498,4
0,Anthony Foxx,0.833910034602,5


# Quiz

###### Question 1
Find the top word count words for Elton John?

In [30]:
elton = people[people['name'] == 'Elton John']

In [31]:
elton_word_count_table = elton[['word_count']].stack('word_count', new_column_name = ['word','count'])
elton_word_count_table.sort('count',ascending=False)[0:5]

word,count
the,27
in,18
and,15
of,13
a,10


###### Question 2
Find the top TF-IDF words for Elton John?

In [32]:
elton[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)[0:5]

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447


###### Question 3
The cosine distance between 'Elton John's and 'Victoria Beckham's articles (represented with TF-IDF) falls within which range?

In [33]:
victoria = people[people['name'] == 'Victoria Beckham']

In [34]:
graphlab.distances.cosine(elton['tfidf'][0],victoria['tfidf'][0])

0.9567006376655429

###### Question 4
The cosine distance between 'Elton John's and 'Paul McCartney's articles (represented with TF-IDF) falls within which range?

In [35]:
mccartney = people[people['name'] == 'Paul McCartney']

In [36]:
graphlab.distances.cosine(elton['tfidf'][0],mccartney['tfidf'][0])

0.8250310029221779

###### Question 5
Who is closer to 'Elton John', 'Victoria Beckham' or 'Paul McCartney'?

Given the answers to question 3 & 4, the tf-idf value of **Paul McCartney** is lower than Victoria, hence Paul is closer.

###### Question 6
Who is the nearest neighbor to 'Elton John' using raw word counts?

**raw word count model**

In [37]:
word_count_model = graphlab.nearest_neighbors.create(people,features=['word_count'],label='name')

In [38]:
word_count_model.query(elton)

query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


Since the question does not provide any of the above options, we are going to manually calculate the distances between options provided. The minimum distance should be the nearest answer.

In [39]:
billy = people[people['name'] == 'Billy Joel']
graphlab.distances.cosine(elton['word_count'][0],billy['word_count'][0])

0.2222176478184027

In [40]:
cliff = people[people['name'] == 'Cliff Richard']
graphlab.distances.cosine(elton['word_count'][0],cliff['word_count'][0])

0.16142415258967036

In [41]:
roger = people[people['name'] == 'Roger Daltrey']
graphlab.distances.cosine(elton['word_count'][0],roger['word_count'][0])

0.17755418466559603

In [42]:
bush = people[people['name'] == 'George W. Bush']
graphlab.distances.cosine(elton['word_count'][0],bush['word_count'][0])

0.22029928745113225

###### Question 7
Who is the nearest neighbor to 'Elton John' using TF-IDF?

In [43]:
knn_model.query(elton)

query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


Even though Phil Collins is the nearest based on tf-idf and KNN, since Phil is not an option, the answer is **Rod Stewart**, who is number 2 on the list

###### Question 8
Who is the nearest neighbor to 'Victoria Beckham' using raw word counts?

In [44]:
word_count_model.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


In [45]:
dow = people[people['name'] == 'Stephen Dow Beckham']
graphlab.distances.cosine(victoria['word_count'][0],dow['word_count'][0])

0.31219127606005204

In [46]:
molloy = people[people['name'] == 'Louis Molloy']
graphlab.distances.cosine(victoria['word_count'][0],molloy['word_count'][0])

0.31050751295927514

In [47]:
corrl = people[people['name'] == 'Adrienne Corri']
graphlab.distances.cosine(victoria['word_count'][0],corrl['word_count'][0])

0.21450978278754795

In [48]:
fitg = people[people['name'] == 'Mary Fitzgerald (artist)']
graphlab.distances.cosine(victoria['word_count'][0],fitg['word_count'][0])

0.20730703611504997

###### Question 9
Who is the nearest neighbor to 'Victoria Beckham' using TF-IDF?

In [49]:
knn_model.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


In [50]:
mel = people[people['name'] == 'Mel B']
graphlab.distances.cosine(victoria['tfidf'][0],mel['tfidf'][0])

0.8095855234085036

In [51]:
caroline = people[people['name'] == 'Caroline Rush']
graphlab.distances.cosine(victoria['tfidf'][0],mel['tfidf'][0])

0.8095855234085036

In [52]:
graphlab.distances.cosine(victoria['tfidf'][0],beckham['tfidf'][0])

0.5481696102632148

In [53]:
carrie = people[people['name'] == 'Carrie Reichardt']
graphlab.distances.cosine(victoria['tfidf'][0],carrie['tfidf'][0])

0.9753950423443598