In [2]:
import pandas as pd
import numpy as np
from scipy.stats import mode

In [3]:
df=pd.read_csv('people_wiki.csv')
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [4]:
df.shape

(42786, 3)

We don't expect the URL to tell us anything non-trivial, since any website might host the information. We can safely remove this column.

In [5]:
df=df.drop('URI',axis=1)#.set_index('name');
df.head()

Unnamed: 0,name,text
0,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,G-Enka,henry krvits born 30 december 1974 in tallinn ...


We now perform a TF-IDF on the Wiki text for each entry in the dataset. This is standard for crawling webpages.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
csr_mat=tfidf.fit_transform(df['text'])

In [7]:
print("shape=",csr_mat.shape)

shape= (42786, 437503)


Indeed, TFIDF has split the texts into a sparse matrix of the training examples in 437503 features! We shall try to reduce it significantly by using NMF. 

In [135]:
from sklearn.decomposition import NMF
model=NMF(n_components=20) 
nmf_features=model.fit_transform(csr_mat)

In order to compute similarities across articles, we use the "dot product" between articles in 20-dim feature space.

In [152]:
from sklearn.preprocessing import normalize
norm_features=normalize(nmf_features)

Let us get the set of all ``names`` that are included in the dataset. They will serve as index when we compute similarities below.

In [151]:
names=df['name']
print(names[:6])

0          Digby Morrell
1         Alfred J. Lewy
2          Harpdog Brown
3    Franz Rottensteiner
4                 G-Enka
5          Sam Henderson
Name: name, dtype: object


In [153]:
df2=pd.DataFrame(norm_features,index=names)

In [154]:
df2.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Digby Morrell,0.02777,0.0,0.084782,0.0,0.0,0.0,0.0,0.179018,0.0,0.002265,0.0,0.001151,0.0,0.0,0.0,0.050885,0.0,0.066558,0.976201,0.0
Alfred J. Lewy,0.383034,0.024262,0.0,0.10811,0.0,0.0,0.0,0.0,0.0,0.0,0.015367,0.0,0.194009,0.101636,0.0,0.0,0.225456,0.018589,0.0,0.861177
Harpdog Brown,0.308762,0.0,0.051626,0.723535,0.020175,0.16608,0.039834,0.488288,0.0,0.040115,0.0,0.0148,0.0,0.0,0.032319,0.130234,0.199188,0.21285,0.068865,0.030649
Franz Rottensteiner,0.196599,0.0,0.0,0.0,0.0,0.0,0.0,0.664114,0.013071,0.0,0.0,0.001577,0.0,0.020166,0.0,0.0,0.641241,0.213193,0.0,0.251158
G-Enka,0.062443,0.0,0.0,0.939263,0.0,0.0,0.0,0.146773,0.0,0.0,0.0,0.0,0.0,0.0,0.250946,0.0,0.0,0.171375,0.0,0.0


This is precisely what we wanted! All that remains is to find articles that have max overlap with the article of our interest.

In [155]:
article=df2.loc['Franz Rottensteiner'] #take an example

In [157]:
similarities=df2.dot(article)
print(similarities.nlargest())

name
Franz Rottensteiner     1.000000
Richard Kirkham         0.995113
Seppo Telenius          0.994699
C. D. Baker (author)    0.994391
Andrew McNeillie        0.993236
dtype: float64


Franz Rottensteiner is an Austrian publisher and critic in the fields of science fiction and speculative fiction in general. 
Richard Kirkham is an American philosopher. 
Seppo Sakari Telenius is a Finnish writer and historian from Helsinki.
C. D. Baker founded an award-winning business before redirecting his career to write full-time from Pennsylvania.
Andrew McNeillie is a British poet and literary editor.