In [2]:
import graphlab # We will be using graphlab

# Loading text data from wikipedia - pages on people

In [4]:
people = graphlab.SFrame("people_wiki.gl/") # I'm creating a datatable
# This is a dataset I have downloaded beforehand. It was provided by Coursera team.

In [8]:
people # I can see that this dataset has 59071 rows and only 3 columns

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [9]:
len(people) # or I can use this command if I only want to know the number of rows

59071

#Exploring the dataset and its text content

In [10]:
obama = people[people["name"] == "Barack Obama"]

In [18]:
obama

URI,name,text
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...


I want to take a look at what does the whole text look like.

In [19]:
obama["text"]

dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campa

In [21]:
clooney = people[people["name"] == "George Clooney"]
clooney["text"]

dtype: str
Rows: ?
['george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his acting debut on television in 1978 and later gained wide recognition in his role as dr doug ross on the longrunning medical drama er from 1994 to 1999 for which he received two emmy award nominations while working on er he began attracting a variety of leading roles in films including the superhero film batman robin 1997 and the crime comedy out of sight 1998 in which he first worked with a director who would become a longtime collaborator steven soderbergh in 1999 clooney took the lead role in three kings a wellreceived war satire set during the gulf warin 2001 clooneys fame widened with the release of his biggest commercial success the heist comedy oceans eleven the first of the film trilogy a remake of the 1960 film wit

# Getting word counts for Obama article

I want to add a new column to the datatable to insert word.count dictionary

In [23]:
obama["word_count"] = graphlab.text_analytics.count_words(obama["text"]) # This is a method of graphlab to count words

In [28]:
obama["word_count"] # This method created a dictionary with words as keys and values as word frequency

dtype: dict
Rows: 1
[{'operations': 1L, 'represent': 1L, 'office': 2L, 'unemployment': 1L, 'doddfrank': 1L, 'over': 1L, 'unconstitutional': 1L, 'domestic': 2L, 'major': 1L, 'years': 1L, 'against': 1L, 'proposition': 1L, 'seats': 1L, 'graduate': 1L, 'debate': 1L, 'before': 1L, 'death': 1L, '20': 2L, 'taxpayer': 1L, 'representing': 1L, 'obamacare': 1L, 'barack': 1L, 'to': 14L, '4': 1L, 'policy': 2L, '8': 1L, 'he': 7L, '2011': 3L, '2010': 2L, '2013': 1L, '2012': 1L, 'bin': 1L, 'then': 1L, 'his': 11L, 'march': 1L, 'gains': 1L, 'cuba': 1L, 'school': 3L, '1992': 1L, 'new': 1L, 'not': 1L, 'during': 2L, 'ending': 1L, 'continued': 1L, 'presidential': 2L, 'states': 3L, 'husen': 1L, 'osama': 1L, 'californias': 1L, 'equality': 1L, 'prize': 1L, 'lost': 1L, 'made': 1L, 'inaugurated': 1L, 'january': 3L, 'university': 2L, 'rights': 1L, 'july': 1L, 'gun': 1L, 'stimulus': 1L, 'rodham': 1L, 'troop': 1L, 'withdrawal': 1L, 'brk': 1L, 'nine': 1L, 'where': 1L, 'referred': 1L, 'affordable': 1L, 'attorney': 1L

## Sorting the word counts for Obama article

In [32]:
obama_word_count_table = obama[["word_count"]].stack("word_count", new_column_name = ["word", "count"]) 
# This method .stack is useful for working with dictionaries. It creates columns from keys and values

In [34]:
obama_word_count_table.sort("count", ascending = False) 
# I sort the table from the most frequent words to the least

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
a,7
he,7


Now we are able to clearly see the problem with this words frequency. From the 10 words, 9 are uninformative and only "obama" word is informative. We will therefore need to apply TF-IDF model to get more relevant result.

## Computing TF-IDF for the entire corpus

tf–idf is the product of two statistics, term frequency and inverse document frequency.

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d)

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient

Source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [37]:
# First I need to count the words for whole corpus, the same way I did for obama
people["word_count"] = graphlab.text_analytics.count_words(people["text"]) # This took nearly no time to compute for 57 000 rows

In [36]:
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."


In [38]:
people_tfidf = graphlab.text_analytics.tf_idf(people["word_count"]) 
# tf_idf is an internal method of graphlab.text_analytics to compute exactly what we need
people_tfidf

docs
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


In [39]:
people["tfidf"] = people_tfidf["docs"] # With this I simply join the new column to the original dataframe

In [40]:
people.head() # there are now 5 columns in the dataframe

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."

tfidf
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


## Examining the tf-idf for the Obama article

I want to create a new obama table, so I can sort stack it and sort it the exact same way as before.

In [41]:
obama = people[people["name"] == "Barack Obama"]

In [43]:
# What I did in multiple lines I can do in oneliner
obama[["tfidf"]].stack("tfidf", new_column_name =["word", "tfidf"]).sort("tfidf", ascending = False)

word,tfidf
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


The tfidf shows what our intuition was at first - the most frequent words in the table were uninformative except the word obama. There are completely new words in the top 10 words now.

# Manually computing distances between a few people

We want to get a visual idea about distance of different people

In [44]:
clinton = people[people["name"] == "Bill Clinton"]

In [45]:
beckham = people[people["name"] == "David Beckham"]

Is Obama closer to Clinton than to Beckham?

In [49]:
graphlab.distances.cosine(obama["tfidf"][0], clinton["tfidf"][0],)
#The lower number of distance, the more similar compared elements are

0.8339854936884276

In [50]:
graphlab.distances.cosine(obama["tfidf"][0], beckham["tfidf"][0],)

0.9791305844747478

Number 1 is the biggest distance can get, so Obama and Beckham are quite far

# Building a nearest neighbor model for document retrieval

In [52]:
knn_model = graphlab.nearest_neighbors.create(people, features = ["tfidf"], label = "name")

PROGRESS: Starting brute force nearest neighbors model training.


# Applying the nearest-neighbor model for retrieval

## Who is closest to Obama?

In [53]:
knn_model.query(obama)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 99.11ms      |
PROGRESS: | Done         |         | 100         | 548.224ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


## Other examples of document retrieval

In [54]:
swift = people[people["name"] == "Taylor Swift"]

In [55]:
knn_model.query(swift)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 14.011ms     |
PROGRESS: | Done         |         | 100         | 395.781ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In [56]:
jolie = people[people["name"] == "Angelina Jolie"]

In [57]:
knn_model.query(jolie)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 19.021ms     |
PROGRESS: | Done         |         | 100         | 317.225ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


## Further exploring additional people - Elton John

In [58]:
elton = people[people["name"] == "Elton John"]

In [65]:
elton_word_count_table = elton[["word_count"]].stack("word_count", new_column_name = ["word", "count"]).sort("count", ascending = False)
elton_word_count_table

word,count
the,27
in,18
and,15
of,13
a,10
has,9
he,7
john,7
on,6
since,5


In [67]:
elton_tfidf_table = elton[["tfidf"]].stack("tfidf", new_column_name =["word", "tfidf"]).sort("tfidf", ascending = False)
elton_tfidf_table

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
overallelton,10.9864953892
tonightcandle,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


In [69]:
elton_word_count_table["tfidf_word"] = elton_tfidf_table["word"]
elton_word_count_table["tfidf"] = elton_tfidf_table["tfidf"]
elton_word_count_table

word,count,tfidf_word,tfidf
the,27,furnish,18.38947184
in,18,elton,17.48232027
and,15,billboard,17.3036809575
of,13,john,13.9393127924
a,10,songwriters,11.250406447
has,9,overallelton,10.9864953892
he,7,tonightcandle,10.9864953892
john,7,19702000,10.2933482087
on,6,fivedecade,10.2933482087
since,5,aids,10.262846934


Comparing columns in this last table, it is clear, that uninformative words are weighted down, while less freqneunt words are weigthed up by used tfidf model.

In [70]:
knn_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 15.629ms     |
PROGRESS: | Done         |         | 100         | 343.757ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


In [71]:
# I will compare Elton John's distance to these two people
victoria = people[people["name"] == "Victoria Beckham"]
mccartney = people[people["name"] == "Paul McCartney"]

In [72]:
graphlab.distances.cosine(elton["tfidf"][0], victoria["tfidf"][0],)

0.9567006376655429

In [73]:
graphlab.distances.cosine(elton["tfidf"][0], mccartney["tfidf"][0],)

0.8250310029221779

In [76]:
words_model = graphlab.nearest_neighbors.create(people, features = ["word_count"], distance="cosine", label = "name")

PROGRESS: Starting brute force nearest neighbors model training.


In [77]:
words_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 11.988ms     |
PROGRESS: | Done         |         | 100         | 309.2ms      |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,2.22044604925e-16,1
0,Cliff Richard,0.16142415259,2
0,Sandro Petrone,0.16822542751,3
0,Rod Stewart,0.168327165587,4
0,Malachi O'Doherty,0.177315545979,5


In [78]:
tfidf_model = graphlab.nearest_neighbors.create(people, features = ["tfidf"], distance="cosine", label = "name")

PROGRESS: Starting brute force nearest neighbors model training.


In [79]:
tfidf_model.query(elton)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 12.008ms     |
PROGRESS: | Done         |         | 100         | 402.284ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Elton John,-2.22044604925e-16,1
0,Rod Stewart,0.717219667893,2
0,George Michael,0.747600998969,3
0,Sting (musician),0.747671954431,4
0,Phil Collins,0.75119324879,5


In [80]:
words_model.query(victoria)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 15.63ms      |
PROGRESS: | Done         |         | 100         | 283.878ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,-2.22044604925e-16,1
0,Mary Fitzgerald (artist),0.207307036115,2
0,Adrienne Corri,0.214509782788,3
0,Beverly Jane Fry,0.217466468741,4
0,Raman Mundair,0.217695474992,5


In [81]:
tfidf_model.query(victoria)

PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 1       | 0.00169288  | 14.011ms     |
PROGRESS: | Done         |         | 100         | 342.681ms    |
PROGRESS: +--------------+---------+-------------+--------------+


query_label,reference_label,distance,rank
0,Victoria Beckham,1.11022302463e-16,1
0,David Beckham,0.548169610263,2
0,Stephen Dow Beckham,0.784986706828,3
0,Mel B,0.809585523409,4
0,Caroline Rush,0.819826422919,5


# Conclusion

This was an introduction to bag of words technique, calculating tfidf and comparing different elements between eaech other.