### Document Retrieval
#### Challenges :
* How to measure similarity among articles?
* How to search over articles?

### Representation of the document: Word count document representation.

#### Most popular model - "Bag of words" model.

* Ignore order of words
* Count # of instances of each word in vocabulary.

* A word count vector is created to handle each word in the document, and the count / no of instances of the word is associated with the word.

### Measuring similarity:
* The vectors from the similar documents are taken. 
* Summation of Element wise multiplication throughout the length of the vector will yield the result(measure of similarity).

#### Example 1 :  2 articles on soccer - there respective word count vectors to measure similarity
* 1000530010000
* 3000200101000
* 1*3 + 5*2 = 13 (Measure of similarity);

#### Example 2 : 1 article on soccer, and the other of conflicts in Africa
* 1000530010000
* 0010009006040
* 0 (Measure of similarity);

### 1. Issues with raw word counts - Document Length
* consider example 1: there the similarity measure was 13.
* on doubling the length of the document, the word count is as represented below
* Every word in the original document appears twice.

* 2 0 0 0 10 6 0 0 2 0 0 0 0
* 6 0 0 0 4  0 0 2 0 2 0 0 0
* Measure of similarity -> 6 * 2 + 10 * 4 = 52;

* The documents information is similar but it is just replicated, this is not right. The bias is towards long documents.

### Solutions for Document length - with raw word counts
#### Normalize the word count vector;
* Take the word count vector and norm the vector.
* Normalization ->  sqrt of the summation of all the squares in the vector.
* 1 0 0 0 5 3 0 0 1 0 0 0 0
* Normalization = sqrt( 1 * 1 + 5 * 5 + 3 * 3 + 1 * 1) = sqrt(1+25+9+1) = sqrt(36) = 6;

* 1/6 0 0 0 5/6 3/6 0 0 1/6 0 0 0

##### This enables to place all the articles considered in equal footing regardless of their length, and use the normalized vector while doing retrieval.

### 2. Issue with word counts - Emphasis on words
#### Rare words
* Consider an article, corpus, common words dominate the similarity metric, in contract rare words that get ignored or swamped.
* Rare words are importance since these terms uniquely identify the document.
* Inorder to increase the importance of the rare words;

#### Document frequency of rare words:
* Appears infrequently in the corpus.
* discount the word w(weight) based on the documents containing the w(weight) in the corpus.

#### Important words
* Words relevant to the article must be emphasized, not usualy words like the , it, etc;
* Appears frequently in the document (common locally);
* Appears rarely in the corpus (rare globally);
* importance is derived from the trade off between local frequency and global rarity.

### TF-IDF Document representation
* TF-IDF - term frequency inverse document frequency -> inorder to show the trade off btween the local frequency and global rarity.

#### Term Frequency -> look locally -> count the number of words within the document -> same as word count vector;
#### Inverse Document Frequency -> downweigth the vector.
* All the documents in the corpus are look through
* log(# doc / (1 + # doc using the word we are looking at)); ## 1 is in the denominator to prevent diniding by zero situation, that could occur in the absence of the word (value in the vector is 0);

* For frequently occuring words -> log(largenum / 1 + largenum) ~ log 1 = 0;
* For rarely occuring words -> log(largenum / 1 + smallnum) = largenum;

#### Example

* Term frequency
 - term "the" occus 1000 times in the document locally, and the term "messi" occus 5 times with the document.
 
           ---------------------------------------------
           |   | 1000 |  |  |  |  |  5   |  |  |  |  |  |
           ---------------------------------------------
                 the                messi
 * Inverse document frequency
  - consider 64 documents, in which the term "the" occurs in 63 of them and messi occurs in 3 of them
  - IDF for "the" = log(64 / 1 + 63) = log(1) = 0;
  - IDF for "messi" = log(64/ 1 + 3) = log(16) = 4;
  
             ---------------------------------------------
             |   |  0  |  |  |  |  |   4   |  |  |  |  |  |
             ---------------------------------------------
                   the                messi
 
 * TF-IDF = term frequency * idf term wise
 
            ---------------------------------------------
           |   |  0  |  |  |  |  |   20  |  |  |  |  |  |
            ---------------------------------------------
                 the                messi

### Retrieving Document
#### Nearest neighbour search 
* Have -> query artcile, and corpus to search articles from;
* In nearest neighbour search
 - Need to specify : Distance metric;
* Output - collection of related articles.

#### 1 - Nearest neighbor
* Input : Query article
* Output : Most similar article
* Algorithm:
 - search over each article in the corpus
  * compute s = similarity(query article, corpus article)
  * If s > Best_s, record doc_article = corpus_article
    and set Best_s = s
  * Return
  
#### K - Nearest neighbor
* Input : Query article
* Output : List of the k similar articles.
* Algorithm : it is similar to the above one, except instead of returning just 1 , a priority of the documents will be retrieved.

### Clustering models and algorithms

#### One way to find related articles -> is to skim through each articles as seen above.
#### Another way is to find structure documents by topic.
* Discover groups/clusters of related articles. sports, world news;

#### Usually the articles don't have labels to classify them;

#### Consider a training set where the articles are labelled, then it is a "Multiclass classification problem." - World news, sports, science, entertainment, technology - This is a case of supervised meachine learning problem;

### Clustering :  An unsupervised learning approach
* No labels provided.
* Want to uncover cluster structure.
* Input : documents as vectors (word count vector);  
* Output : cluster labels for each document; labels are provided postfacto;

* Each document gets a cluster label;

### What defines a cluster?
* Cluster is defined by a center and shape/spread.
* Assign observation(document) to the cluster (topic label).
 - Approach 1 : Score under the cluster is higher than others.
 - Approach 2 : More similar to the assigned cluster center than other cluster center.
 

### k-means : A clustering Algorithm
* Assume - Similarity = distance to cluster center (smaller better);

* Step 1 : Need to choose more of clusters (k).Initialize cluster centers;
* Step 2 : Assign observations to the closest cluster center - using Voronoi tessellation;
* Step 3 : Revise cluster centers as mean of assigned observations. (Since initially the cluster centers are randomally initialized , it not necessarily represents the structure of the underlying data. So therefore iteration on the observations is performed inorder to retrieve a better cluster centers that fit the data); New cluster centers will be derived;
* Step 4 : Repeat 2, 3 steps until convergence.


### Clustering applications
* Clustering images;
* Grouping patients by medical conditions;
* Product recommendation on Amazon; -> discovering groups of related users;
* structuring web search results;

### Clustering and Similarity ML block diagram
##### Unsupervised

           (document id, 
            document test,                         (TF-IDF)      Clustering   estimated cluster label
            tabel)                                     X           model         y hat
            Training Data--------> Feature extraction -----------> ML Model ------------------------------->
                                                        |           / \                       |           
                                                        |            |                        |
                                                        |            |                        |
                                                        |   |-------w hat (cluster center)    |
                                                        |   |       / \                       |
                                                        |   |        |                        |
                                                        |   |        |                        |
                                                        |   |     ML Algorithm (k-means)      |
                                                        |   |       / \                       |
                                                        |   |        |                        |
                                                        |   |        |  (distance to cluster  |
                                                        |   |cluster |   center               |
                                                        |   |center  |                        |
                                                        |   |-----> Quality <------------------
                                                        |---------> Metric           
                                                           TF-IDF   

### Document Retrieval

In [1]:
import graphlab

#### Load some text data - from wikipedia, pages on people

In [2]:
people = graphlab.SFrame('people_wiki.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to amitha353@gmail.com and will expire on May 07, 2019.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Amitha\AppData\Local\Temp\graphlab_server_1526268794.log.0


In [3]:
people.head()

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [4]:
len(people)

59071

### Explore the dataset - checkout the text in it

In [5]:
obama = people[people['name'] == 'Barack Obama']

In [6]:
obama

URI,name,text
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...


In [7]:
obama['text']

dtype: str
Rows: ?
['barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in november he began his presidential campa

In [8]:
modi = people[people['name'] == 'Narendra Modi']
modi

URI,name,text
<http://dbpedia.org/resou rce/Narendra_Modi> ...,Narendra Modi,narendra damodardas modi gujarati nrendr dmodrds ...


In [9]:
modi['text']

dtype: str
Rows: ?
['narendra damodardas modi gujarati nrendr dmodrds modi 13px born 17 september 1950 is the 15th and current prime minister of india in office since may 2014 modi a leader of the bharatiya janata party bjp previously served as the chief minister of gujarat from 2001 to 2014 he is currently the member of parliament mp from varanasimodi led the bjp in the 2014 general election which resulted in an outright majority for the bjp in the lok sabha the lower house of the indian parliament the last time that any party had secured an outright majority in the lok sabha was in 1984 since then modi has also been credited for the bjps electoral victories in the states of haryana and maharashtra in october 2014modi is a hindu nationalist and a member of the rashtriya swayamsevak sangh rss he is a controversial figure both within india as well as internationally as his administration has been criticised for failing to act to prevent the 2002 gujarat riots modi has been praised for h

### Manual data model

### Get word count for the Obama article

In [10]:
obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])

In [11]:
print obama['word_count']

[{'operations': 1L, 'represent': 1L, 'office': 2L, 'unemployment': 1L, 'doddfrank': 1L, 'over': 1L, 'unconstitutional': 1L, 'domestic': 2L, 'major': 1L, 'years': 1L, 'against': 1L, 'proposition': 1L, 'seats': 1L, 'graduate': 1L, 'debate': 1L, 'before': 1L, 'death': 1L, '20': 2L, 'taxpayer': 1L, 'representing': 1L, 'obamacare': 1L, 'barack': 1L, 'to': 14L, '4': 1L, 'policy': 2L, '8': 1L, 'he': 7L, '2011': 3L, '2010': 2L, '2013': 1L, '2012': 1L, 'bin': 1L, 'then': 1L, 'his': 11L, 'march': 1L, 'gains': 1L, 'cuba': 1L, 'school': 3L, '1992': 1L, 'new': 1L, 'not': 1L, 'during': 2L, 'ending': 1L, 'continued': 1L, 'presidential': 2L, 'states': 3L, 'husen': 1L, 'osama': 1L, 'californias': 1L, 'equality': 1L, 'prize': 1L, 'lost': 1L, 'made': 1L, 'inaugurated': 1L, 'january': 3L, 'university': 2L, 'rights': 1L, 'july': 1L, 'gun': 1L, 'stimulus': 1L, 'rodham': 1L, 'troop': 1L, 'withdrawal': 1L, 'brk': 1L, 'nine': 1L, 'where': 1L, 'referred': 1L, 'affordable': 1L, 'attorney': 1L, 'on': 2L, 'often':

### Data engineering - Sort word count 

In [12]:
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name=['word', 'count']) # stack -create dictionary

In [13]:
obama_word_count_table.head()

word,count
normalize,1
sought,1
combat,1
continued,1
unconstitutional,1
8,1
californias,1
1996,1
marriage,1
defense,1


In [14]:
obama_word_count_table.sort('count', ascending=False)

word,count
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
a,7
he,7


### Computing TF-IDF
#### It must be computed for the entire corpus

In [15]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."


In [16]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
tfidf

dtype: dict
Rows: 59071
[{'since': 1.455376717308041, 'carltons': 7.0744723837970485, 'being': 1.7938099524877322, '2005': 1.6425861253275964, '2008': 1.5093391374786154, 'coach': 5.444264118987054, 'its': 1.6875948402695313, 'before': 2.9935647453367427, 'australia': 2.86858644684204, '21': 2.797250863489293, 'northern': 3.310021742836038, 'bullants': 7.489987827758714, 'to': 0.23472468840899613, 'perth': 5.051601193605607, 'sydney': 3.5981675296480873, 'selection': 3.836578553093086, '2014': 2.2073995783446634, 'has': 0.428497539744039, '2011': 1.7023470901042919, '2013': 1.9545642372230505, 'division': 2.7906099979103978, 'his': 0.7878343656409719, 'was': 0.3968289280609173, 'rules': 3.8272034844276295, 'assistant': 2.5220702633476124, 'spanned': 5.531174273867493, 'early': 1.929422753652229, 'game': 2.4168995190159084, 'five': 2.2137301792754096, 'during': 1.3174651479035495, 'continued': 2.720588055069447, '44game': 9.887883100557085, 'cause': 4.8023464982877115, 'twice': 3.330158

In [17]:
tfidf.head()

dtype: dict
Rows: 10
[{'since': 1.455376717308041, 'carltons': 7.0744723837970485, 'being': 1.7938099524877322, '2005': 1.6425861253275964, '2008': 1.5093391374786154, 'coach': 5.444264118987054, 'its': 1.6875948402695313, 'before': 2.9935647453367427, 'australia': 2.86858644684204, '21': 2.797250863489293, 'northern': 3.310021742836038, 'bullants': 7.489987827758714, 'to': 0.23472468840899613, 'perth': 5.051601193605607, 'sydney': 3.5981675296480873, 'selection': 3.836578553093086, '2014': 2.2073995783446634, 'has': 0.428497539744039, '2011': 1.7023470901042919, '2013': 1.9545642372230505, 'division': 2.7906099979103978, 'his': 0.7878343656409719, 'was': 0.3968289280609173, 'rules': 3.8272034844276295, 'assistant': 2.5220702633476124, 'spanned': 5.531174273867493, 'early': 1.929422753652229, 'game': 2.4168995190159084, 'five': 2.2137301792754096, 'during': 1.3174651479035495, 'continued': 2.720588055069447, '44game': 9.887883100557085, 'cause': 4.8023464982877115, 'twice': 3.330158222

In [18]:
people['tfidf'] = tfidf

### Examine tf-idf for Obama article

In [19]:
obama = people[people['name'] == 'Barack Obama' ]

In [20]:
obama.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Barack_Obama> ...,Barack Obama,barack hussein obama ii brk husen bm born august ...,"{'operations': 1L, 'represent': 1L, ..."

tfidf
"{'operations': 3.811771079388818, ..."


In [21]:
obama[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf']).sort('tfidf', ascending=False)

word,tfidf
obama,43.2956530721
act,27.678222623
iraq,17.747378588
control,14.8870608452
law,14.7229357618
ordered,14.5333739509
military,13.1159327785
involvement,12.7843852412
response,12.7843852412
democratic,12.4106886973


### Computing the distance between Wikipedia pages

In [22]:
clinton = people[people['name'] == 'Bill Clinton']

In [23]:
beckham = people[people['name'] == 'David Beckham']

### Compute the similarity of the choosen people (Clinton, Beckham) w.r.t to Obama.

In [24]:
# lower the cosine distance, better the similarity.
graphlab.distances.cosine(obama['tfidf'][0], clinton['tfidf'][0])

0.8339854936884276

In [25]:
graphlab.distances.cosine(obama['tfidf'][0], beckham['tfidf'][0])

0.9791305844747478

##### Here since Obama and clinton have a lower cosine value they are more similar;

### Build a nearest neighbour model for document retrieval

In [26]:
knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'],label='name')

In [27]:
knn_model.query(obama)

query_label,reference_label,distance,rank
0,Barack Obama,0.0,1
0,Joe Biden,0.794117647059,2
0,Joe Lieberman,0.794685990338,3
0,Kelly Ayotte,0.811989100817,4
0,Bill Clinton,0.813852813853,5


In [28]:
modi = people[people['name'] == 'Narendra Modi']

In [29]:
knn_model.query(modi)

query_label,reference_label,distance,rank
0,Narendra Modi,0.0,1
0,Anupriya Patel,0.784210526316,2
0,Ram Jethmalani,0.8,3
0,Anant Geete,0.810526315789,4
0,Lakshman Singh (politician) ...,0.815533980583,5


### Other examples

In [30]:
swift = people[people['name'] == 'Taylor Swift']

In [31]:
knn_model.query(swift)

query_label,reference_label,distance,rank
0,Taylor Swift,0.0,1
0,Carrie Underwood,0.76231884058,2
0,Alicia Keys,0.764705882353,3
0,Jordin Sparks,0.769633507853,4
0,Leona Lewis,0.776119402985,5


In [32]:
jolie = people[people['name'] == 'Angelina Jolie']

In [33]:
knn_model.query(jolie)

query_label,reference_label,distance,rank
0,Angelina Jolie,0.0,1
0,Brad Pitt,0.784023668639,2
0,Julianne Moore,0.795857988166,3
0,Billy Bob Thornton,0.803069053708,4
0,George Clooney,0.8046875,5


In [34]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [35]:
knn_model.query(arnold)

query_label,reference_label,distance,rank
0,Arnold Schwarzenegger,0.0,1
0,Jesse Ventura,0.818918918919,2
0,John Kitzhaber,0.824615384615,3
0,Lincoln Chafee,0.833876221498,4
0,Anthony Foxx,0.833910034602,5


In [36]:
mark = people[people['name'] == 'Mark Zuckerberg']

In [37]:
mark

URI,name,text,word_count
<http://dbpedia.org/resou rce/Mark_Zuckerberg> ...,Mark Zuckerberg,mark elliot zuckerberg born may 14 1984 is an ...,"{'moskovitz': 1L, 'afterwards': 1L, 'fr ..."

tfidf
"{'moskovitz': 10.986495389225194, ..."


In [38]:
knn_model.query(mark)

query_label,reference_label,distance,rank
0,Mark Zuckerberg,0.0,1
0,Roya Mahboob,0.817796610169,2
0,Nicholas A. Christakis,0.820627802691,3
0,David Karp,0.822134387352,4
0,David Falk,0.826923076923,5


### Week 4 - Assignment

### Senario 1 : Comparing top words according to word counts to TF-IDF

In [39]:
people.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."

tfidf
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


In [53]:
Individuals = graphlab.SFrame('people_wiki.gl/')

In [54]:
Individuals

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


In [55]:
elton = Individuals[Individuals['name'] == 'Elton John']

In [56]:
elton

URI,name,text
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...


In [57]:
elton['word_count'] = graphlab.text_analytics.count_words(elton['text'])

In [58]:
print elton

+-------------------------------+------------+-------------------------------+
|              URI              |    name    |              text             |
+-------------------------------+------------+-------------------------------+
| <http://dbpedia.org/resour... | Elton John | sir elton hercules john cb... |
+-------------------------------+------------+-------------------------------+
+-------------------------------+
|           word_count          |
+-------------------------------+
| {'all': 1L, 'six': 1L, 'pr... |
+-------------------------------+
[1 rows x 4 columns]



In [60]:
# obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name=['word', 'count']) # stack -create dictionary
elton_word_count_table = elton[['word_count']].stack('word_count', new_column_name=['word', 'count'])

In [61]:
elton_word_count_table = elton_word_count_table.sort('count', ascending=False)

In [62]:
elton_word_count_table.head()

word,count
the,27
in,18
and,15
of,13
a,10
has,9
he,7
john,7
on,6
since,5


### Q1 . Top word count words for Elton John
### (the, in, and)

### TF-IDF is computed on the whole data-set

In [63]:
Individuals['word_count'] = graphlab.text_analytics.count_words(Individuals['text'])

In [64]:
Individuals.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."


In [65]:
tfidfcomputed = graphlab.text_analytics.tf_idf(Individuals['word_count'])

In [66]:
tfidfcomputed

dtype: dict
Rows: 59071
[{'since': 1.455376717308041, 'carltons': 7.0744723837970485, 'being': 1.7938099524877322, '2005': 1.6425861253275964, '2008': 1.5093391374786154, 'coach': 5.444264118987054, 'its': 1.6875948402695313, 'before': 2.9935647453367427, 'australia': 2.86858644684204, '21': 2.797250863489293, 'northern': 3.310021742836038, 'bullants': 7.489987827758714, 'to': 0.23472468840899613, 'perth': 5.051601193605607, 'sydney': 3.5981675296480873, 'selection': 3.836578553093086, '2014': 2.2073995783446634, 'has': 0.428497539744039, '2011': 1.7023470901042919, '2013': 1.9545642372230505, 'division': 2.7906099979103978, 'his': 0.7878343656409719, 'was': 0.3968289280609173, 'rules': 3.8272034844276295, 'assistant': 2.5220702633476124, 'spanned': 5.531174273867493, 'early': 1.929422753652229, 'game': 2.4168995190159084, 'five': 2.2137301792754096, 'during': 1.3174651479035495, 'continued': 2.720588055069447, '44game': 9.887883100557085, 'cause': 4.8023464982877115, 'twice': 3.330158

In [67]:
Individuals['tfidf'] = tfidfcomputed

In [68]:
Individuals.head()

URI,name,text,word_count
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...,"{'since': 1L, 'carltons': 1L, 'being': 1L, '2005': ..."
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...,"{'precise': 1L, 'thomas': 1L, 'closely': 1L, ..."
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...,"{'just': 1L, 'issued': 1L, 'mainly': 1L, ..."
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...,"{'all': 1L, 'bauforschung': 1L, ..."
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'legendary': 1L, 'gangstergenka': 1L, ..."
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...,"{'now': 1L, 'currently': 1L, 'less': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...,"{'exclusive': 2L, 'producer': 1L, 'tribe': ..."
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...,"{'taxi': 1L, 'salon': 1L, 'gangs': 1L, 'being': ..."
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...,"{'houston': 1L, 'frankie': 1L, 'labels': ..."
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...,"{'phenomenon': 1L, 'deborash': 1L, ..."

tfidf
"{'since': 1.455376717308041, ..."
"{'precise': 6.44320060695519, ..."
"{'just': 2.7007299687108643, ..."
"{'all': 1.6431112434912472, ..."
"{'legendary': 4.280856294365192, ..."
"{'now': 1.96695239252401, 'currently': ..."
"{'exclusive': 10.455187230695827, ..."
"{'taxi': 6.0520214560945025, ..."
"{'houston': 3.935505942157149, ..."
"{'phenomenon': 5.750053426395245, ..."


In [72]:
elton = Individuals[Individuals['name'] == 'Elton John']
elton

URI,name,text,word_count
<http://dbpedia.org/resou rce/Elton_John> ...,Elton John,sir elton hercules john cbe born reginald ken ...,"{'all': 1L, 'six': 1L, 'producer': 1L, ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [73]:
# obama[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf']).sort('tfidf', ascending=False)
elton[['tfidf']].stack('tfidf', new_column_name=['word', 'tfidf']).sort('tfidf', ascending=False)

word,tfidf
furnish,18.38947184
elton,17.48232027
billboard,17.3036809575
john,13.9393127924
songwriters,11.250406447
overallelton,10.9864953892
tonightcandle,10.9864953892
19702000,10.2933482087
fivedecade,10.2933482087
aids,10.262846934


### Q2. Top TF-TDF words for Elton John
### (furnish, elton, billboard)

### Scenario 2 : Measuring distance

In [74]:
victoria = Individuals[Individuals['name'] == 'Victoria Beckham']

In [75]:
paul = Individuals[Individuals['name'] == 'Paul McCartney']

In [76]:
victoria

URI,name,text,word_count
<http://dbpedia.org/resou rce/Victoria_Beckham> ...,Victoria Beckham,victoria caroline beckham ne adams born 17 april ...,"{'millionin': 1L, 'saying': 1L, 'cameo': ..."

tfidf
"{'millionin': 7.728398851203712, ..."


In [77]:
paul

URI,name,text,word_count
<http://dbpedia.org/resou rce/Paul_McCartney> ...,Paul McCartney,sir james paul mccartney mbe born 18 june 1942 is ...,"{'all': 1L, 'gold': 1L, 'over': 1L, 'kintyre': ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [78]:
# lower the cosine distance, better the similarity.
#graphlab.distances.cosine(obama['tfidf'][0], clinton['tfidf'][0])
graphlab.distances.cosine(elton['tfidf'][0], victoria['tfidf'][0])

0.9567006376655429

In [79]:
graphlab.distances.cosine(elton['tfidf'][0], paul['tfidf'][0])

0.8250310029221779

### Q3. The cosine distance between 'Elton John's and 'Victoria Beckham's articles (represented with TF-IDF) falls within which range?
### 0.9 to 1

### Q4. The cosine distance between 'Elton John's and 'Paul McCartney's articles (represented with TF-IDF) falls within which range?
### 0.7 to 0.89

### Q5. Who is closer to 'Elton John', 'Victoria Beckham' or 'Paul McCartney'?
### Paul McCartney

### Scenario 3 : Knn model to compute the distance via word_count and tfidf model.

In [80]:
knn_model_individuals = graphlab.nearest_neighbors.create(Individuals, features=['word_count'], label='name')

In [81]:
knn_model_individuals.query(elton)

query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


In [87]:
billy = Individuals[Individuals['name'] == 'Billy Joel']
cliff = Individuals[Individuals['name'] == 'Cliff Richard']
roger = Individuals[Individuals['name'] == 'Roger Daltrey']
bush = Individuals[Individuals['name'] == 'George Bush']
billy

URI,name,text,word_count
<http://dbpedia.org/resou rce/Billy_Joel> ...,Billy Joel,william martin billy joel born may 9 1949 is an ...,"{'all': 4L, 'singersongwriter': 1L, ..."

tfidf
"{'all': 6.572444973964989, ..."


In [88]:
cliff

URI,name,text,word_count
<http://dbpedia.org/resou rce/Cliff_Richard> ...,Cliff Richard,sir cliff richard obe born harry rodger web ...,"{'all': 1L, 'six': 1L, 'softening': 1L, 'over': ..."

tfidf
"{'all': 1.6431112434912472, ..."


In [90]:
roger

URI,name,text,word_count
<http://dbpedia.org/resou rce/Roger_Daltrey> ...,Roger Daltrey,roger harry daltrey cbe born 1 march 1944 is an ...,"{'all': 2L, 'producer': 1L, 'gods': 1L, 'over': ..."

tfidf
"{'all': 3.2862224869824943, ..."


In [91]:
bush

URI,name,text,word_count,tfidf


### Q6. Who is the nearest neighbor to 'Elton John' using raw word counts?

In [94]:
print graphlab.distances.cosine(elton['word_count'][0], billy['word_count'][0])
print graphlab.distances.cosine(elton['word_count'][0], cliff['word_count'][0])
print graphlab.distances.cosine(elton['word_count'][0], roger['word_count'][0])
# print graphlab.distances.cosine(elton['word_count'][0],  bush['word_count'][0]) -> error

0.222217647818
0.16142415259
0.177554184666


### Cliff Richard

In [95]:
knn_model_tfidf = graphlab.nearest_neighbors.create(Individuals, features=['tfidf'],label='name')

In [96]:
knn_model_tfidf.query(elton)

query_label,reference_label,distance,rank
0,Elton John,0.0,1
0,Phil Collins,0.76399026764,2
0,Rod Stewart,0.773333333333,3
0,Annie Lennox,0.776623376623,4
0,Barry Gibb,0.780952380952,5


### Q7. Who is the nearest neighbor to 'Elton John' using TF-IDF?
### Rod Stewart

### Q8. Who is the nearest neighbor to 'Victoria Beckham' using raw word counts?

In [97]:
knn_model_individuals.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


In [101]:
dow = Individuals[Individuals['name'] == 'Stephen Dow Beckham']
louis = Individuals[Individuals['name'] == 'Louis Molloy']
corri = Individuals[Individuals['name'] == 'Adrienne Corri']
mary = Individuals[Individuals['name'] == 'Mary Fitzgerald (artist)']

In [104]:
print graphlab.distances.cosine(victoria['word_count'][0], dow['word_count'][0])
print graphlab.distances.cosine(victoria['word_count'][0], louis['word_count'][0])
print graphlab.distances.cosine(victoria['word_count'][0], corri['word_count'][0])
print graphlab.distances.cosine(victoria['word_count'][0],  mary['word_count'][0])

0.31219127606
0.310507512959
0.214509782788
0.207307036115


### Mary Fitzgerald (artist)

### Q9. Who is the nearest neighbor to 'Victoria Beckham' using TF-IDF?

In [105]:
knn_model_tfidf.query(victoria)

query_label,reference_label,distance,rank
0,Victoria Beckham,0.0,1
0,Cheryl Cole,0.800586510264,2
0,Heidi Klum,0.810344827586,3
0,Simon Fuller,0.822742474916,4
0,Adele,0.824915824916,5


In [106]:
mel = Individuals[Individuals['name'] == 'Mel B']
rush = Individuals[Individuals['name'] == 'Caroline Rush']
david = Individuals[Individuals['name'] == 'David Beckham']
carrie = Individuals[Individuals['name'] == 'Carrie Reichardt']

In [107]:
print graphlab.distances.cosine(victoria['tfidf'][0], mel['tfidf'][0])
print graphlab.distances.cosine(victoria['tfidf'][0], rush['tfidf'][0])
print graphlab.distances.cosine(victoria['tfidf'][0], david['tfidf'][0])
print graphlab.distances.cosine(victoria['tfidf'][0],  carrie['tfidf'][0])

0.809585523409
0.819826422919
0.548169610263
0.975395042344


### David Beckham