### Document Clustering
NEW REPLACEMENT script.

Since downloading a wikipedia dump, I've started my code over from scratch. `reformat.py` contains some important code to preprocess the outputted wiki dump from the wikiextractor I used.

This notebook will open a few sample JSON files and attempt to cluster them.

In [3]:
import json
import glob

In [4]:
wiki_articles = []
for x in glob.glob('data/wiki*.json'):
    new_articles = json.load(open(x))['articles']
    wiki_articles += new_articles
wiki_articles[0].keys()

dict_keys(['text', 'id', 'url', 'title'])

In [5]:
for article in wiki_articles[:5]:
    for key, val in article.items():
        if key != 'text':
            print(key, ':', val)

id : 12
url : https://en.wikipedia.org/wiki?curid=12
title : Anarchism
id : 25
url : https://en.wikipedia.org/wiki?curid=25
title : Autism
id : 39
url : https://en.wikipedia.org/wiki?curid=39
title : Albedo
id : 290
url : https://en.wikipedia.org/wiki?curid=290
title : A
id : 303
url : https://en.wikipedia.org/wiki?curid=303
title : Alabama


Looks like my JSON conversions worked! The only problem now, is finding links that are actually related to one another so I can effectively test my clustering algorithm. I don't have my data pre-organized by topics...

We'll try K-means with jaccard and KL divergence

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.8, max_features=20000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, ngram_range=(1,3))
wiki_articles_text = [x['text'] for x in wiki_articles]
tfidf_vector = tfidf.fit_transform(wiki_articles_text)

In [9]:
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_vector)


In [12]:
from sklearn.cluster import KMeans
num_clusters = 5

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_vector)

clusters = km.labels_.tolist()

CPU times: user 3.72 s, sys: 0 ns, total: 3.72 s
Wall time: 3.74 s


In [13]:
from sklearn.externals import joblib
joblib.dump(km,  'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()