# Document retrieval from wikipedia data

## Contants

In [1]:
import pandas as pd
import numpy as np
import re
import string
import collections

from sklearn import model_selection
from sklearn.feature_extraction import DictVectorizer

## Load Wikipedia pages data

In [2]:
people = pd.read_csv("./data/people_wiki.csv")

Data contains:  link to wikipedia article, name of person, text of article.

In [3]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
len(people)

59071

## Data Exploration

### Exploring the entry for president _Obama_

In [5]:
obama = people[people['name'] == 'Barack Obama']
obama

Unnamed: 0,URI,name,text
35817,<http://dbpedia.org/resource/Barack_Obama>,Barack Obama,barack hussein obama ii brk husen bm born augu...


### Exploring the entry for actor _George Clooney_

In [6]:
clooney = people[people['name'] == 'George Clooney']
clooney['text']

38514    george timothy clooney born may 6 1961 is an a...
Name: text, dtype: object

### Get the word counts for people

In [7]:
people['word_count'] = people.text.apply(lambda x: dict(collections.Counter(x.split())))

In [8]:
people.head()

Unnamed: 0,URI,name,text,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'afl': 1, 'round': 1, 'coach': 2, 'aflfrom': ..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'secretion': 1, 'lewy': 3, 'he': 4, 'an': 2, ..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'2010': 1, 'he': 5, 'while': 1, 'bluesgospel'..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'anthologies': 1, 'critical': 1, 'institut': ..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'pankrot': 1, 'krvits': 1, 'palmisaar': 1, 'f..."


In [9]:
obama = people[people['name'] == 'Barack Obama']
idx = obama.index.tolist()
idx

[35817]

### Sort the word counts for the Obama article

#### Turning dictonary of word counts into a table

In [10]:
word_count = people['word_count'][35817]

In [11]:
obama_word_count = pd.DataFrame.from_dict(word_count, orient="index")

#### Sorting the word counts in descending order

In [12]:
obama_word_count.sort_values(by = [0], ascending=False)[0:10]

Unnamed: 0,0
the,40
in,30
and,21
of,18
to,14
his,11
obama,9
act,8
a,7
he,7


Most common words include uninformative words like _**the**_, _**in**_, _**and**_...

# Compute TF-IDF for the corpus 

To give more weight to informative words, we weigh them by their TF-IDF scores.

## Examine the TF-IDF for the Obama article

Words with highest TF-IDF are much more informative.

# Manually compute distances between a few people

Let's manually compare the distances between the articles for a few famous people.  

In [None]:
clinton = people[people['name'] == 'Bill Clinton']

In [None]:
beckham = people[people['name'] == 'David Beckham']

## Is Obama closer to Clinton than to Beckham?

We will use cosine distance, which is given by

(1-cosine_similarity) 

and find that the article about president Obama is closer to the one about former president Clinton than that of footballer David Beckham.

# Build a nearest neighbor model for document retrieval

We now create a nearest-neighbors model and apply it to document retrieval.  

# Applying the nearest-neighbors model for retrieval

## Who is closest to Obama?

As we can see, president Obama's article is closest to the one about his vice-president Biden, and those of other politicians.  

## Other examples of document retrieval

In [None]:
jolie = people[people['name'] == 'Angelina Jolie']

In [None]:
knn_model.query(jolie)

In [None]:
arnold = people[people['name'] == 'Arnold Schwarzenegger']

In [None]:
knn_model.query(arnold)