In [1]:
import pandas as pd

%matplotlib inline

In [2]:
people = pd.read_csv("people_wiki.csv")

In [5]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
people.shape

(59071, 3)

# 1. Compare top words according to word counts to TF-IDF

In [25]:
elton = people.query("name == 'Elton John'")
elton_idx = elton.index
elton

Unnamed: 0,URI,name,text
19923,<http://dbpedia.org/resource/Elton_John>,Elton John,sir elton hercules john cbe born reginald kenn...


In [8]:
elton.text.item()

'sir elton hercules john cbe born reginald kenneth dwight 25 march 1947 is an english singer songwriter composer pianist record producer and occasional actor he has worked with lyricist bernie taupin as his songwriter partner since 1967 they have collaborated on more than 30 albums to datein his fivedecade career elton john has sold more than 300 million records making him one of the bestselling music artists in the world he has more than fifty top 40 hits including seven consecutive no 1 us albums 58 billboard top 40 singles 27 top 10 four no 2 and nine no 1 for 31 consecutive years 19702000 he had at least one song in the billboard hot 100 his single something about the way you look tonightcandle in the wind 1997 sold over 33 million copies worldwide and is the bestselling single of all time he has received six grammy awards five brit awards winning two awards for outstanding contribution to music and the first brits icon in 2013 for his lasting impact on british culture an academy a

## 1.1 Calculate the word counts

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

term_matrix = count_vect.fit_transform(elton.text)

In [19]:
elton_word_counts = pd.DataFrame(
    data=term_matrix.toarray()[0],
    index=count_vect.get_feature_names(),
    columns=['count']
)

In [20]:
elton_word_counts.sort_values(by='count', ascending=False)

Unnamed: 0,count
the,27
in,18
and,15
of,13
has,9
...,...
events,1
fellow,1
fifty,1
fight,1


### What are the 3 words in his articles with highest word counts?

* the
* in
* and

## 1.2 Calculate the TF-IDF for people's wikipedia

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

tfidf_term_matrix = tfidf_vect.fit_transform(people.text)

In [39]:
elton_tfidf = pd.DataFrame(
    data=tfidf_term_matrix[elton_idx].toarray()[0],
    index=tfidf_vect.get_feature_names(),
    columns=['tfidf']
)

In [40]:
elton_tfidf.sort_values(by='tfidf', ascending=False)

Unnamed: 0,tfidf
billboard,0.220815
john,0.217082
elton,0.212174
furnish,0.208194
songwriters,0.137278
...,...
equestranauts,0.000000
equerry,0.000000
equavalent,0.000000
equatorzipser,0.000000


### What are the 3 words in his articles with highest TF-IDF? 

* billboard
* john
* elton
* furnish
* songwriters

# 2. Measuring distance

In [42]:
from sklearn.metrics.pairwise import cosine_distances

In [49]:
# find our intereseted persons
elton_idx = people.query("name == 'Elton John'").index
victoria_idx = people.query("name == 'Victoria Beckham'").index
paul_idx = people.query("name == 'Paul McCartney'").index

#### What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? 

In [51]:
cosine_distances(tfidf_term_matrix[elton_idx], tfidf_term_matrix[victoria_idx])[0]

array([0.96592977])

#### What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’?

In [52]:
cosine_distances(tfidf_term_matrix[elton_idx], tfidf_term_matrix[paul_idx])[0]

array([0.81008627])

#### Which one of the two is closest to Elton John?
* Recall, the lower the closer it is to the famous "Elton John"

* So, **Paul McCartney** is closer to "Elton John" than ‘Victoria Beckham’

#### Does this result make sense to you?

* **Yes**, because both Elton John and Paul McCartney are song writer, whereas Victoria Beckham is an English singer

# 3. Building nearest neighbors models with different input features and setting the distance metric

## 3.1 Using word counts as features

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_distances

In [64]:
# calculate the term frequency for the entire corpus
count_vect = CountVectorizer()
count_term_matrix = count_vect.fit_transform(people.text)

In [77]:
nbrs = NearestNeighbors(metric=cosine_distances)

nbrs.fit(count_term_matrix)

NearestNeighbors(metric=<function cosine_distances at 0x7ff4f09dd1f0>)

#### What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?

In [78]:
distances, indices = nbrs.kneighbors(count_term_matrix[elton_idx])

people.iloc[indices[0]]

**Answer**: Cliff Richard

#### What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?

In [83]:
distances, indices = nbrs.kneighbors(count_term_matrix[victoria_idx])

people.iloc[indices[0]]

Unnamed: 0,URI,name,text
50411,<http://dbpedia.org/resource/Victoria_Beckham>,Victoria Beckham,victoria caroline beckham ne adams born 17 apr...
669,<http://dbpedia.org/resource/Mary_Fitzgerald_(...,Mary Fitzgerald (artist),mary fitzgerald born 1956 is an irish artist w...
45129,<http://dbpedia.org/resource/Adrienne_Corri>,Adrienne Corri,adrienne corri born 13 november 1931 glasgow s...
39504,<http://dbpedia.org/resource/Beverly_Jane_Fry>,Beverly Jane Fry,beverly jane fry is an australian ballerina bo...
13937,<http://dbpedia.org/resource/Raman_Mundair>,Raman Mundair,raman mundair is a british poet writer artist ...


**Answer**: Mary Fitzgerald

## 3.2 Using TF-IDF as features

In [85]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_distances

In [86]:
# calculate the tf-idf for the entire corpus
tfidf_vect = TfidfVectorizer()

tfidf_term_matrix = tfidf_vect.fit_transform(people.text)

In [88]:
nbrs = NearestNeighbors(metric=cosine_distances)

nbrs.fit(tfidf_term_matrix)

NearestNeighbors(metric=<function cosine_distances at 0x7ff4f09dd1f0>)

#### What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?

In [89]:
distances, indices = nbrs.kneighbors(tfidf_term_matrix[elton_idx])

people.iloc[indices[0]]

**Answer**: Rod Stewart

#### What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?

In [92]:
distances, indices = nbrs.kneighbors(tfidf_term_matrix[victoria_idx])

people.iloc[indices[0]]

**Answer**: David Bechkham