<a class="anchor" id="first-part"></a>

#!/usr/bin/env python<br># coding: utf-8

Author: Bao Cai

Course: Machine Learning for Descriptive Problems

Topic: NLP-Unsupervised

Start Date: 2020-03-11

Last Save: 2020-03-11

[1. Topic analysis](#topic-part)

There are quantitative measures for evaluating various aspects of clustering results and topic models intrinsically, and they can also be evaluated extrinsically by how well the clusters/topics serve some supervised task as features. In this exervise, however, we well fovus on qualitative evaluation of the results in terms of their descriptiveness. As in the example code, you may limit yourself to the 1000 first documents of ther corpus when performing clustering, in order to simplify the task and speed up experimentation, but use the whole corpus to calculate tf-idf features.

a. Experiment with different setups of the tf-idf feature extraction and clustering (k-means or hierarchical). In order to obtain meaningful results. When you arrive at a good configuration, describe it and motivate your chosen setup/parameters.

b. Inspect the keywords of the clusters. List the 10 first clusters out of all (o.e. not cherry picked examples) and privde an as descriptive label as possible for each of them.

c. Select one or two good clusters (that can be clearly interpreted) and one or two bad clusters (that might be difficult to interpret or distinguish). Motivate your choise (clusters may for instance, be overlapping, to broad/narrow or incoherent).

d. Repeat the experiment in (a) with LDA topic modelling instead (on the whole corpus), and explain briefly how the results compare to your previously chosen clustering setup. A few concrete examples may be helpful. Do your best to make sure the list of topic keywords are informative through appropriate post-processing.

[2. Word vectors](#word-vector)

a. Choose about 5 words (arbitrary) to use as seed words in the following experiment. Train word2vec vectors on the corpus while trying out variations on the parameters. Evaluate the vector models by inspecting the most similar words for each of the seed words, and try to identify qualitative differences between different parameter choices. Which parameters seem to have the most interesting effect? At what values? Motivate. Finally, study the qualitative effect of increasing the training data, by similarly comparing vectors trained with the best setup on the texts from the awards_2020 directory against vectors trained on the whole set of abstracts (1990-2002).

b. Repeat the experiment with ELMo from the lecture, with a different target word and diferent sentences. Choose a word that can have multiple senses, and construct 10 sentences that express 2-3 different senses of the word. Produce ELMo embeddings for the target word in each sentence and measure the similarity between the vectors. Evaluate in how many cases the measured similarities can be used to successfully distinguish between the different senses. Comment on the results, e.g. are you able to identify a particular way in which the model fails?

[top](#first-part)

In [28]:
import numpy
import os
import re
import binascii
import itertools
import heapq
import numpy as np
import pandas as pd
from time import time
from collections import Counter
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

In [8]:
# Functions
def get_fnames(path='./Data'):
    """Read all text files in a folder.
    """
    fnames = []
    for root, _, files in os.walk(path):
        for fname in files:
            if fname[-4:] == '.txt':
                fnames.append(os.path.join(root, fname))
    return fnames

def read_file(fname):
    with open(fname, 'rt', encoding='latin-1') as f:
        # skip all lines until abstract
        for line in f:
            if "Abstract    :" in line:
                break

        # get abstract as a single string
        abstract = ' '.join([line[:-1].strip() for line in f])
        abstract = re.sub(' +', ' ', abstract)  # remove double spaces
        return abstract

In [9]:
documents = [read_file(file) for file in get_fnames('./Data/awards_2002/')]

In [14]:
# Set parameters and initialize
tfidf_vectorizer = TfidfVectorizer(min_df=2, use_idf=True, sublinear_tf=True, max_df=1.0, max_features=20000)
# Tip: the vectorizer also supports extracting n-gram features (common short sequences of words),
# which may be more descriptive but also much less frequent

# Calcualate term-document matrix with tf-idf scores
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Check matrix shape
tfidf_matrix.toarray().shape # N_docs x N_terms

(9923, 20000)

In [16]:
terms_in_docs = tfidf_vectorizer.inverse_transform(tfidf_matrix)
token_counter = Counter()
for terms in terms_in_docs:
    token_counter.update(terms)

for term, count in token_counter.most_common(20):
    print("%d\t%s" % (count, term))

9637	the
9619	of
9613	and
9511	to
9442	in
8743	this
8625	for
8269	is
8228	will
7632	be
7419	on
7271	with
7167	that
6656	are
6642	research
6561	by
6424	as
5968	from
5750	an
5402	these


In [18]:
print(sorted(zip(features, tfidf_vectorizer._tfidf.idf_),key=lambda x:x[1])[:20])

[('the', 1.0292424476114135), ('of', 1.0311118011519393), ('and', 1.0317356963577857), ('to', 1.0424019064410739), ('in', 1.0496823395486226), ('this', 1.126588314955317), ('for', 1.1401751676039031), ('is', 1.1823215567939547), ('will', 1.1872915652059788), ('be', 1.2624751130132212), ('on', 1.2907770086502655), ('with', 1.310924708954377), ('that', 1.3253293901569252), ('are', 1.399287133210988), ('research', 1.4013923971464504), ('by', 1.4136606312906448), ('as', 1.434759435048271), ('from', 1.5083766566430419), ('an', 1.5455823130979374), ('these', 1.608001710967626)]


In [17]:
## Inspect top terms per document

features = tfidf_vectorizer.get_feature_names()
for doc_i in range(5):
    print("\nDocument %d, top terms by TF-IDF" % doc_i)
    for term, score in sorted(list(zip(features,tfidf_matrix.toarray()[doc_i])), key=lambda x:-x[1])[:5]:
        print("%.2f\t%s" % (score, term))


Document 0, top terms by TF-IDF
0.34	flower
0.24	color
0.22	mutations
0.21	pollinators
0.21	differences

Document 1, top terms by TF-IDF
0.36	pollinator
0.25	inbreeding
0.22	fragmentation
0.21	correlation
0.20	fragments

Document 2, top terms by TF-IDF
0.32	pulsars
0.23	binaries
0.20	galaxy
0.18	survey
0.18	observatory

Document 3, top terms by TF-IDF
0.26	dogs
0.23	prairie
0.20	captivity
0.20	predators
0.18	reared

Document 4, top terms by TF-IDF
0.32	copulatory
0.32	cannibalism
0.22	males
0.21	reproductive
0.19	mate


In [19]:
print(tfidf_matrix.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [20]:
print("Document vector length:", tfidf_matrix.shape[1])
for i in range(5):
    print("Non-zero dimensions for document %d: %d" % (i, len([x for x in tfidf_matrix.toarray()[i] if x > 0])))

Document vector length: 20000
Non-zero dimensions for document 0: 112
Non-zero dimensions for document 1: 88
Non-zero dimensions for document 2: 95
Non-zero dimensions for document 3: 152
Non-zero dimensions for document 4: 132


In [21]:
print("Sample word:", features[1000])
print("Occurs in %d documents" % len([x for x in tfidf_matrix.toarray()[:][1000] if x > 0]))
print("out of %d documents" % len(tfidf_matrix.toarray()))

Sample word: allyl
Occurs in 97 documents
out of 9923 documents


In [31]:
# matrix_sample = tfidf_matrix[:1000]
matrix_sample = tfidf_matrix
# Do clustering
km = KMeans(n_clusters=30, random_state=123, verbose=0)
km.fit(matrix_sample)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=30, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=123, tol=0.0001, verbose=0)

In [32]:
def print_clusters(matrix, clusters, n_keywords=10):
    for cluster in range(min(clusters), max(clusters)+1):
        cluster_docs = [i for i, c in enumerate(clusters) if c == cluster]
        print("Cluster: %d (%d docs)" % (cluster, len(cluster_docs)))
        
        # Keep scores for top n terms
        new_matrix = np.zeros((len(cluster_docs), matrix.shape[1]))
        for cluster_i, doc_vec in enumerate(matrix[cluster_docs].toarray()):
            for idx, score in heapq.nlargest(n_keywords, enumerate(doc_vec), key=lambda x:x[1]):
                new_matrix[cluster_i][idx] = score

        # Aggregate scores for kept top terms
        keywords = heapq.nlargest(n_keywords, zip(new_matrix.sum(axis=0), features))
        print(', '.join([w for s,w in keywords]))
        print()

In [33]:
km.labels_

array([22,  8, 25, ..., 26,  0, 11], dtype=int32)

In [34]:
print_clusters(matrix_sample, km.labels_)

Cluster: 0 (625 docs)
solar, wind, magnetic, ionosphere, ocean, magnetosphere, oceanographic, iron, auroral, waves

Cluster: 1 (343 docs)
fluid, quantum, flows, turbulence, particles, transport, turbulent, fluids, colloidal, adsorption

Cluster: 2 (491 docs)
center, equipment, igert, manufacturing, ucrc, facility, engineering, station, industry, polymer

Cluster: 3 (367 docs)
workshop, geoscience, geon, eu, workshops, government, federal, committee, earth, cyberinfrastructure

Cluster: 4 (254 docs)
algebraic, theory, algebras, spaces, algebra, quantum, geometry, commutative, conjecture, geometric

Cluster: 5 (330 docs)
conference, symposium, meeting, gordon, speakers, 2002, 2003, young, travel, congress

Cluster: 6 (893 docs)
social, contract, political, firms, children, organizational, language, archaeological, policy, cultural

Cluster: 7 (326 docs)
wireless, networks, network, sensor, mobile, power, routing, qos, nodes, traffic

Cluster: 8 (374 docs)
forest, ecosystem, soil, species

In [None]:
Z = linkage(matrix_sample.todense(), metric='cosine', method='complete')
_ = dendrogram(Z, no_labels=True) # Plot dentrogram chart

In [None]:
## Get flat clusters from cluster hierarchy

#clusters = fcluster(Z, 50, criterion='maxclust') # Create fix number of flat clusters
clusters = fcluster(Z, 0.99, criterion='distance') # Create flat clusters by distance threshold

print_clusters(matrix_sample, clusters)