# Layout

+ Opening
+ packages
    + scikit-learn
    + gensim
+ Introduce ML
+ Training
    + Use sklearn dataset examples
+ Extraction
    + Probabilistic models
    + Deterministic models
    + stop lists
    + cleaning
    + Performance
    + [sklearn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)
        + [Tfidf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
        + [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
        + Hierarchical
    + [gensim](http://radimrehurek.com/gensim/apiref.html)
        + [LDA](https://radimrehurek.com/gensim/models/ldamodel.html)
            + Expansions on LDA
        + [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)
        + Doc2Vec
    + Implement our own models
        + regressions

# Week 3 - Clustering

Intro stuff ...

For this notebook we will be using the following packages

In [1]:
import sklearn
import sklearn.feature_extraction.text
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.datasets
import sklearn.cluster

import gensim
import nltk
import numpy as np
import pandas as pd
import metaknowledge as mk

import time

We can get a dataset to work on from sklearn

In [2]:
#data_home argument will let you change the download location

newsgroups = sklearn.datasets.fetch_20newsgroups(subset='train')
print(dir(newsgroups))

['DESCR', 'data', 'description', 'filenames', 'target', 'target_names']


We can get the categories with `target_names` or the actual files with `filenames`

In [3]:
print(newsgroups.target_names)
print(len(newsgroups.data))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
11314


lets reduce our dataset for this analysis and drop some of the extraneous information

In [4]:
categories = ['comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos']
newsgroups = sklearn.datasets.fetch_20newsgroups(subset='train', categories = categories, remove=['headers', 'footers', 'quotes'])

The contents are stored in `data`

In [5]:
print(len(newsgroups.data))
print("\n".join(newsgroups.data[2].split("\n")[:15]))

2350
Looking for a VIDEO in and OUT Video card for the IBM.  One that will
allow you to watch TV (coax) or video IN, and will do Video out,
digitize pictures.  and if I am in Windows, and would like to be able to
look the RCA out for the card to my TV and have it display on there, as
well as DOS apps.

I heard of these SNES and Genesis copiers, that will copy any games, are
those for real?
                                                                                                                            


In [6]:
count_vect = sklearn.feature_extraction.text.CountVectorizer()
X_train_counts = count_vect.fit_transform(newsgroups.data)
print(X_train_counts.shape)
print(count_vect.vocabulary_.get('algorithm'))

(2350, 23525)
3121


In [7]:
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf.shape)

(2350, 23525)


In [8]:
tfidf_transformer = sklearn.feature_extraction.text.TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)

(2350, 23525)


In [9]:
list(zip(count_vect.vocabulary_.keys(), X_train_tfidf.data))[:15]

[('liabilities', 0.21420446505887239),
 ('swimwear', 0.13747920830953553),
 ('mercs', 0.088365351304721923),
 ('intrinsicsp', 0.091564463286724435),
 ('tia', 0.041290324202552547),
 ('tackling', 0.06544181933636592),
 ('0040000d', 0.1414485456965873),
 ('260', 0.16071862845194229),
 ('wpd', 0.096273767261683074),
 ('sabotage', 0.072904541887069643),
 ('bailey', 0.1850416732187698),
 ('diaphram', 0.12929531261824911),
 ('rpk105', 0.087311129052314446),
 ('suit', 0.21969075879490346),
 ('magnus', 0.12823336632667126)]

Lots of garabge from unique words and stopwords

In [10]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_df=0.5, max_features=10000, min_df=3, stop_words='english', norm='l2', use_idf=True)
X = vectorizer.fit_transform(newsgroups.data)
list(zip(vectorizer.get_feature_names()[3000:3010], X.data[3000:3010]))

[('george', 0.076706593286425179),
 ('german', 0.02664763900591155),
 ('germany', 0.14498027575724973),
 ('gets', 0.032641029874755159),
 ('getting', 0.036245068939312432),
 ('gf', 0.028235479836039856),
 ('gfx', 0.029258280771752555),
 ('gfxbase', 0.034749257578655317),
 ('ghg', 0.022975631270420991),
 ('ghost', 0.032641029874755159)]

In [11]:
true_k = np.unique(newsgroups.target_names).shape[0]
true_k

4

In [12]:
km = sklearn.cluster.KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, verbose=1)

print("Clustering sparse data with {}".format(km))
t0 = time.time()
km.fit(X) #km.fit(X_train_tfidf)
print("done in {:0.3f}s".format(time.time() - t0))

print("Homogeneity: {:0.3f}".format(sklearn.metrics.homogeneity_score(newsgroups.target, km.labels_)))
print("Completeness: {:0.3f}".format(sklearn.metrics.completeness_score(newsgroups.target, km.labels_)))
print("V-measure: {:0.3f}".format(sklearn.metrics.v_measure_score(newsgroups.target, km.labels_)))
print("Adjusted Rand-Index: {:.3f}".format(sklearn.metrics.adjusted_rand_score(newsgroups.target, km.labels_)))
print("Silhouette Coefficient: {:0.3f}".format(sklearn.metrics.silhouette_score(X, newsgroups.target, sample_size=1000)))

Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=4, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)
Initialization complete
Iteration  0, inertia 4397.963
Iteration  1, inertia 2243.047
Iteration  2, inertia 2239.600
Iteration  3, inertia 2236.259
Iteration  4, inertia 2233.050
Iteration  5, inertia 2230.648
Iteration  6, inertia 2229.491
Iteration  7, inertia 2228.807
Iteration  8, inertia 2228.342
Iteration  9, inertia 2228.034
Iteration 10, inertia 2227.781
Iteration 11, inertia 2227.598
Iteration 12, inertia 2227.443
Iteration 13, inertia 2227.328
Iteration 14, inertia 2227.270
Iteration 15, inertia 2227.232
Iteration 16, inertia 2227.218
Iteration 17, inertia 2227.201
Iteration 18, inertia 2227.151
Iteration 19, inertia 2227.125
Iteration 20, inertia 2227.105
Iteration 21, inertia 2227.088
Iteration 22, inertia 2227.076
Iteration 23, inertia 2227.070
Converged at iteratio

In [13]:
sklearn.metrics.homogeneity_score??

In [14]:
terms = vectorizer.get_feature_names()
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print('\n')

Top terms per cluster:
Cluster 0:
 00
 sale
 offer
 shipping
 condition
 new
 asking
 interested
 sell
 email


Cluster 1:
 car
 like
 just
 don
 new
 good
 cars
 think
 know
 use


Cluster 2:
 mac
 apple
 drive
 thanks
 know
 does
 monitor
 card
 simms
 use


Cluster 3:
 window
 server
 motif
 application
 program
 widget
 use
 using
 x11r5
 file




# Gensim

loading abstracts from raw wos data

In [15]:
RC = mk.RecordCollection('../data/imetricrecs.txt')
recsData = {'abstract' : [], 'id' : [], 'authors' : []}
for R in RC:
    if R.get('abstract') is not None:
        recsData['abstract'].append(R.get('abstract'))
        recsData['id'].append(R.id)
        recsData['authors'].append(R.get('authorsFull'))
imetric_abstracts = pd.DataFrame(recsData)
imetric_abstracts[:10]

Unnamed: 0,abstract,authors,id
0,A bibliometric approach is explored to trackin...,"[Moed, Henk F., Halevi, Gali]",WOS:000345136000022
1,This paper compared and contrasted patent coun...,"[Sung, Hui-Yun, Wang, Chun-Chieh, Chen, Dar-Ze...",WOS:000339379600015
2,A Triple Helix (TH) network of bi- and trilate...,"[Ivanova, Inga A., Leydesdorff, Loet]",WOS:000335905000018
3,A bibliometric analysis was conducted to evalu...,"[Tan, Jiang, Fu, Hui-Zhen, Ho, Yuh-Shan]",WOS:000330622600043
4,"Nowadays, the development of emerging technolo...","[Wang, Xuefeng, Li, Rongrong, Ren, Shiming, Zh...",WOS:000331559800010
5,It is examined whether the number (J) of (join...,"[Bougrine, Hassan]",WOS:000330622600016
6,Scientific co-authorship of African researcher...,"[Pouris, Anastassios, Ho, Yuh-Shan]",WOS:000331559800037
7,In this article barycenters of the places of p...,"[Verleysen, Frederik T., Engels, Tim C. E.]",WOS:000343609900032
8,While there is a large body of research analyz...,"[Yoshikane, Fuyuki, Suzuki, Takafumi]",WOS:000331559800019
9,This paper analyzes the relationship among res...,"[Ibanez, Alfonso, Bielza, Concha, Larranaga, P...",WOS:000317746900012


Lets tokenize and filter the abstracts a bit

In [16]:
stoplist = set('for a of the and to in'.split())
def abstractFilter(abString):
    sents = nltk.sent_tokenize(abString)
    texts = [word for sent in sents for word in sent.lower().split() if word not in stoplist]
    return texts

imetric_abstracts['abs'] = imetric_abstracts['abstract'].apply(abstractFilter)
imetric_abstracts[:10]

Unnamed: 0,abstract,authors,id,abs
0,A bibliometric approach is explored to trackin...,"[Moed, Henk F., Halevi, Gali]",WOS:000345136000022,"[bibliometric, approach, is, explored, trackin..."
1,This paper compared and contrasted patent coun...,"[Sung, Hui-Yun, Wang, Chun-Chieh, Chen, Dar-Ze...",WOS:000339379600015,"[this, paper, compared, contrasted, patent, co..."
2,A Triple Helix (TH) network of bi- and trilate...,"[Ivanova, Inga A., Leydesdorff, Loet]",WOS:000335905000018,"[triple, helix, (th), network, bi-, trilateral..."
3,A bibliometric analysis was conducted to evalu...,"[Tan, Jiang, Fu, Hui-Zhen, Ho, Yuh-Shan]",WOS:000330622600043,"[bibliometric, analysis, was, conducted, evalu..."
4,"Nowadays, the development of emerging technolo...","[Wang, Xuefeng, Li, Rongrong, Ren, Shiming, Zh...",WOS:000331559800010,"[nowadays,, development, emerging, technology,..."
5,It is examined whether the number (J) of (join...,"[Bougrine, Hassan]",WOS:000330622600016,"[it, is, examined, whether, number, (j), (join..."
6,Scientific co-authorship of African researcher...,"[Pouris, Anastassios, Ho, Yuh-Shan]",WOS:000331559800037,"[scientific, co-authorship, african, researche..."
7,In this article barycenters of the places of p...,"[Verleysen, Frederik T., Engels, Tim C. E.]",WOS:000343609900032,"[this, article, barycenters, places, publicati..."
8,While there is a large body of research analyz...,"[Yoshikane, Fuyuki, Suzuki, Takafumi]",WOS:000331559800019,"[while, there, is, large, body, research, anal..."
9,This paper analyzes the relationship among res...,"[Ibanez, Alfonso, Bielza, Concha, Larranaga, P...",WOS:000317746900012,"[this, paper, analyzes, relationship, among, r..."


In [17]:
bigram = gensim.models.Phrases(imetric_abstracts['abs'])
bigrammed = (bigram[imetric_abstracts['abs']])
trigram = gensim.models.Phrases(bigrammed)
trigrammed = (trigram[bigrammed])

In [18]:
modelSaveLoc = '../imetricsmodel'

start = time.time()
model = gensim.models.Word2Vec(trigrammed, workers=4, batch_words=10000)

for iteration in range(10):
    model.train(trigrammed)

vocab_matrix = model.syn0
vocabulary = model.index2word

model.save(modelSaveLoc)

end = time.time()
print(end - start)

26.405695915222168


In [19]:
model = gensim.models.Word2Vec.load(modelSaveLoc)
model['renewable']

array([-0.3949624 , -0.22495362,  0.15555388,  0.43530717,  0.69420147,
       -0.32774708, -0.18479134,  0.39014307, -0.45391074, -0.05440942,
        0.10095476,  0.27137527, -0.0231375 ,  0.34073639,  0.09912231,
       -0.18628053, -0.21471679,  0.01861936, -0.08713119,  0.23915564,
        0.04913566,  0.31107786,  0.29575941, -0.49446738, -0.14355353,
       -0.03819824, -0.01955821, -0.38507015,  0.01835928,  0.23673563,
        0.19441639,  0.1636242 , -0.09832141,  0.50012702, -0.19998699,
        0.11287316, -0.12851557, -0.13711846,  0.26076308, -0.30599201,
        0.44110763,  0.38901994,  0.13803509,  0.26594684, -0.01821186,
       -0.13151573, -0.02712949, -0.25130689, -0.30537039,  0.08836263,
        0.44266155,  0.0689806 ,  0.22709419, -0.39330652, -0.16798811,
        0.54647827,  0.22472142, -0.08905712, -0.15847112,  0.40746388,
        0.18080944, -0.1040563 ,  0.12385511,  0.60678685,  0.44112048,
       -0.72353733, -0.13739991, -0.4349502 , -0.31172612,  0.13